---
title: "Preprocessing"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Preprocessing}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```
Preprocessing cleans and prepares text for analysis.
## Workflow
```{r}
library(TextAnalysisR)
mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(
united_tbl,
text_field = "united_texts",
remove_punct = TRUE,
remove_numbers = TRUE
)
tokens_clean <- quanteda::tokens_remove(tokens, quanteda::stopwords("en"))
dfm_object <- quanteda::dfm(tokens_clean)
```
**Unite Text Columns**
Unite combines multiple text columns into a single column for analysis. Useful when text content is spread across multiple fields that should be analyzed together.
**Examples:**
- **Survey Data:** Combine multiple open-ended response columns
- **Multi-field Text:** Merge title, abstract, and body fields
- **Comments:** Concatenate multiple comment or note columns
**Usage:** Select one or multiple text columns to combine. Columns are concatenated with spaces between them. The united column becomes the text source for all subsequent preprocessing and analysis steps.
**Learn More:** [tidyr Unite Function](https://tidyr.tidyverse.org/reference/unite.html)
**Tokenization Options**
Tokenization segments continuous text into individual units (tokens), typically words, converting unstructured text into structured format for computational analysis.
**Options:**
- **Lowercase:** Convert all text to lowercase to treat "Text" and "text" as identical
- **Remove Punctuation:** Strip punctuation marks like periods, commas, quotes
- **Remove Numbers:** Eliminate numeric digits (keep for technical texts)
- **Remove Symbols:** Remove special characters (@, #, $, etc.)
- **Remove URLs:** Identify and remove web addresses
| Parameter | Default | Use Case |
|-----------|---------|----------|
| `remove_punct` | TRUE | FALSE for sentiment analysis |
| `remove_numbers` | TRUE | FALSE for quantitative text |
| `lowercase` | TRUE | FALSE to preserve case |
**Usage:** Select preprocessing options based on the analysis goals. Sentence segmentation splits text into sentences before tokenization when sentence structure is important (e.g., sentiment analysis).
**Learn More:** [quanteda Tokens Documentation](https://quanteda.io/reference/tokens.html)
**Stopword Removal**
Stopwords are common words (e.g., "the", "is", "and") that appear frequently but carry little meaningful content for analysis. Removing them reduces noise and improves focus on content-bearing words.
**When to Remove:**
- **Topic Modeling:** Helps identify content themes by removing function words
- **Keyword Extraction:** Ensures meaningful terms rise to the top
- **Content Analysis:** Focuses on substantive vocabulary
**Usage:** Use predefined stopword lists (e.g., Snowball) or add custom words. For sentiment analysis or syntactic studies, consider keeping stopwords as they may carry important meaning.
**Learn More:** [stopwords Package Documentation](https://search.r-project.org/CRAN/refmans/stopwords/html/stopwords.html)
**Lemmatization**
Lemmatization reduces words to their base or dictionary form (lemma). For example, "running", "ran", and "runs" all become "run". This groups related word forms together for more meaningful analysis.
**Comparison:**
- **Lemmatization:** Uses linguistic knowledge to produce valid dictionary words (studies → study)
- **Stemming:** Uses simple rules to chop word endings (studies → studi)
- **Advantage:** Lemmatization produces readable, meaningful base forms
**Usage:** Apply lemmatization after tokenization to consolidate word variants. Particularly useful for topic modeling and keyword extraction where grouping related forms improves interpretability. Requires Python with spaCy.
**Learn More:** [spaCy Lemmatization Guide](https://spacy.io/usage/linguistic-features#lemmatization)
**Document-Feature Matrix (DFM)**
A Document-Feature Matrix (DFM) is a mathematical representation where rows are documents, columns are unique tokens (features), and cells contain frequency counts. It converts unstructured text into structured numerical format for computational analysis.
**Process:**
- **Tokenization:** Text is split into individual tokens (words)
- **Vocabulary:** All unique tokens form the matrix columns
- **Counting:** Each document-token pair is counted
- **Sparse Matrix:** Efficient storage format for large corpora
**Usage:** The DFM is the foundation for all downstream analyses including keyword extraction, topic modeling, and semantic analysis. Create it after preprocessing (tokenization, stopword removal, lemmatization).
**Learn More:** [quanteda DFM Documentation](https://quanteda.io/reference/dfm.html)
## Multi-word Expressions
Detect phrases like "machine learning" and compound them in the tokens object:
```{r, eval = FALSE}
compounds <- detect_multi_words(tokens, min_count = 10)
tokens <- quanteda::tokens_compound(tokens, compounds)
```