---
title: "Preprocessing"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Preprocessing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

Preprocessing cleans and prepares text for analysis.

## Workflow

```{r}
library(TextAnalysisR)

mydata <- SpecialEduTech

united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))

tokens <- prep_texts(
  united_tbl,
  text_field = "united_texts",
  remove_punct = TRUE,
  remove_numbers = TRUE
)

tokens_clean <- quanteda::tokens_remove(tokens, quanteda::stopwords("en"))

dfm_object <- quanteda::dfm(tokens_clean)
```

**Unite Text Columns**

Unite combines multiple text columns into a single column for analysis. Useful when text content is spread across multiple fields that should be analyzed together.

**Examples:**

- **Survey Data:** Combine multiple open-ended response columns
- **Multi-field Text:** Merge title, abstract, and body fields
- **Comments:** Concatenate multiple comment or note columns

**Usage:** Select one or multiple text columns to combine. Columns are concatenated with spaces between them. The united column becomes the text source for all subsequent preprocessing and analysis steps.

**Learn More:** [tidyr Unite Function](https://tidyr.tidyverse.org/reference/unite.html)

**Tokenization Options**

Tokenization segments continuous text into individual units (tokens), typically words, converting unstructured text into structured format for computational analysis.

**Options:**

- **Lowercase:** Convert all text to lowercase to treat "Text" and "text" as identical
- **Remove Punctuation:** Strip punctuation marks like periods, commas, quotes
- **Remove Numbers:** Eliminate numeric digits (keep for technical texts)
- **Remove Symbols:** Remove special characters (@, #, $, etc.)
- **Remove URLs:** Identify and remove web addresses

| Parameter | Default | Use Case |
|-----------|---------|----------|
| `remove_punct` | TRUE | FALSE for sentiment analysis |
| `remove_numbers` | TRUE | FALSE for quantitative text |
| `lowercase` | TRUE | FALSE to preserve case |

**Usage:** Select preprocessing options based on the analysis goals. Sentence segmentation splits text into sentences before tokenization when sentence structure is important (e.g., sentiment analysis).

**Learn More:** [quanteda Tokens Documentation](https://quanteda.io/reference/tokens.html)

**Stopword Removal**

Stopwords are common words (e.g., "the", "is", "and") that appear frequently but carry little meaningful content for analysis. Removing them reduces noise and improves focus on content-bearing words.

**When to Remove:**

- **Topic Modeling:** Helps identify content themes by removing function words
- **Keyword Extraction:** Ensures meaningful terms rise to the top
- **Content Analysis:** Focuses on substantive vocabulary

**Usage:** Use predefined stopword lists (e.g., Snowball) or add custom words. For sentiment analysis or syntactic studies, consider keeping stopwords as they may carry important meaning.

**Learn More:** [stopwords Package Documentation](https://search.r-project.org/CRAN/refmans/stopwords/html/stopwords.html)

**Lemmatization**

Lemmatization reduces words to their base or dictionary form (lemma). For example, "running", "ran", and "runs" all become "run". This groups related word forms together for more meaningful analysis.

**Comparison:**

- **Lemmatization:** Uses linguistic knowledge to produce valid dictionary words (studies → study)
- **Stemming:** Uses simple rules to chop word endings (studies → studi)
- **Advantage:** Lemmatization produces readable, meaningful base forms

**Usage:** Apply lemmatization after tokenization to consolidate word variants. Particularly useful for topic modeling and keyword extraction where grouping related forms improves interpretability. Requires Python with spaCy.

**Learn More:** [spaCy Lemmatization Guide](https://spacy.io/usage/linguistic-features#lemmatization)

**Document-Feature Matrix (DFM)**

A Document-Feature Matrix (DFM) is a mathematical representation where rows are documents, columns are unique tokens (features), and cells contain frequency counts. It converts unstructured text into structured numerical format for computational analysis.

**Process:**

- **Tokenization:** Text is split into individual tokens (words)
- **Vocabulary:** All unique tokens form the matrix columns
- **Counting:** Each document-token pair is counted
- **Sparse Matrix:** Efficient storage format for large corpora

**Usage:** The DFM is the foundation for all downstream analyses including keyword extraction, topic modeling, and semantic analysis. Create it after preprocessing (tokenization, stopword removal, lemmatization).

**Learn More:** [quanteda DFM Documentation](https://quanteda.io/reference/dfm.html)

## Multi-word Expressions

Detect phrases like "machine learning" and compound them in the tokens object:

```{r, eval = FALSE}
compounds <- detect_multi_words(tokens, min_count = 10)
tokens <- quanteda::tokens_compound(tokens, compounds)
```