Preprocessing cleans and prepares text for analysis.
library(TextAnalysisR)
mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(
united_tbl,
text_field = "united_texts",
remove_punct = TRUE,
remove_numbers = TRUE
)
tokens_clean <- quanteda::tokens_remove(tokens, quanteda::stopwords("en"))
dfm_object <- quanteda::dfm(tokens_clean)Unite Text Columns
Unite combines multiple text columns into a single column for analysis. Useful when text content is spread across multiple fields that should be analyzed together.
Examples:
Usage: Select one or multiple text columns to combine. Columns are concatenated with spaces between them. The united column becomes the text source for all subsequent preprocessing and analysis steps.
Learn More: tidyr Unite Function
Tokenization Options
Tokenization segments continuous text into individual units (tokens), typically words, converting unstructured text into structured format for computational analysis.
Options:
| Parameter | Default | Use Case |
|---|---|---|
remove_punct |
TRUE | FALSE for sentiment analysis |
remove_numbers |
TRUE | FALSE for quantitative text |
lowercase |
TRUE | FALSE to preserve case |
Usage: Select preprocessing options based on the analysis goals. Sentence segmentation splits text into sentences before tokenization when sentence structure is important (e.g., sentiment analysis).
Learn More: quanteda Tokens Documentation
Stopword Removal
Stopwords are common words (e.g., “the”, “is”, “and”) that appear frequently but carry little meaningful content for analysis. Removing them reduces noise and improves focus on content-bearing words.
When to Remove:
Usage: Use predefined stopword lists (e.g., Snowball) or add custom words. For sentiment analysis or syntactic studies, consider keeping stopwords as they may carry important meaning.
Learn More: stopwords Package Documentation
Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma). For example, “running”, “ran”, and “runs” all become “run”. This groups related word forms together for more meaningful analysis.
Comparison:
Usage: Apply lemmatization after tokenization to consolidate word variants. Particularly useful for topic modeling and keyword extraction where grouping related forms improves interpretability. Requires Python with spaCy.
Learn More: spaCy Lemmatization Guide
Document-Feature Matrix (DFM)
A Document-Feature Matrix (DFM) is a mathematical representation where rows are documents, columns are unique tokens (features), and cells contain frequency counts. It converts unstructured text into structured numerical format for computational analysis.
Process:
Usage: The DFM is the foundation for all downstream analyses including keyword extraction, topic modeling, and semantic analysis. Create it after preprocessing (tokenization, stopword removal, lemmatization).
Learn More: quanteda DFM Documentation