--- title: "Topic Modeling" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Topic Modeling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE) ``` Topic modeling discovers latent themes in a text collection. The Shiny app offers two types: - **Structural Topic Model (STM):** word-probability topics that incorporate document metadata as covariates. Runs in R. - **Embedding-based topics:** transformer embeddings reduced and clustered into topics. Best for short texts and multilingual content. Requires Python (BERTopic) or an embedding API key. The sections below follow the app's **Topic Modeling** tabs in order, first with STM (run live on the bundled data) and then with the embedding-based type. ## Setup A 150-document subset of `SpecialEduTech` keeps the build fast; the full dataset works the same way. ```{r} library(TextAnalysisR) mydata <- SpecialEduTech[1:150, ] united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract")) tokens <- prep_texts(united_tbl, text_field = "united_texts", remove_stopwords = TRUE) dfm_object <- quanteda::dfm(tokens) ``` ## Model Configuration `find_optimal_k()` compares models across a range of K via `searchK` (semantic coherence, exclusivity, held-out likelihood, residuals). The example uses a small document subset and a narrow K range to keep the runtime short. ```{r} k_search <- find_optimal_k(dfm_object, topic_range = 5:6) names(k_search) ``` Fit a small STM for the examples below. `prevalence` makes topic proportions depend on document metadata. ```{r} out <- quanteda::convert(dfm_object, to = "stm") out$meta$year <- as.numeric(out$meta$year) model <- stm::stm( documents = out$documents, vocab = out$vocab, K = 5, prevalence = ~ reference_type + year, data = out$meta, max.em.its = 15, init.type = "Spectral", verbose = FALSE ) ``` See: [Structural Topic Model](https://www.structuraltopicmodel.com/) · [stm on CRAN](https://CRAN.R-project.org/package=stm) ## Word-Topic `get_topic_terms()` returns the top terms per topic from the word-topic (beta) distribution. ```{r} terms <- get_topic_terms(model, top_term_n = 10) head(terms, 15) ``` ## Content Generation `generate_topic_content()` drafts topic-grounded content (survey items, research questions, theme descriptions, policy recommendations, interview questions) from `get_topic_terms()` output. It calls an LLM, so it needs an OpenAI or Gemini API key. | Content type | Output | |------|--------| | `survey_item` | Likert-scale statement | | `research_question` | Research question | | `theme_description` | Qualitative theme summary | | `policy_recommendation` | Action-oriented statement | | `interview_question` | Open-ended question | ```{r, eval = FALSE} labels <- generate_topic_labels(terms, provider = "openai") content <- generate_topic_content(terms, content_type = "research_question", provider = "openai") ``` ## Document-Topic `calculate_topic_probability()` summarizes the corpus-level expected topic proportions (mean of the per-document theta matrix). ```{r} doc_topic <- calculate_topic_probability(model, top_n = 10, verbose = FALSE) doc_topic ``` ## Quotes `stm::findThoughts()` returns the documents most representative of a topic. ```{r} quotes <- stm::findThoughts(model, texts = united_tbl$united_texts, topics = 1, n = 3) quotes ``` ## Estimated Effects `stm::estimateEffect()` regresses topic proportions on document covariates; `tidytext::tidy()` returns the per-topic coefficient table shown in the app's Estimated Effects tab. ```{r} prep <- stm::estimateEffect(1:5 ~ reference_type + year, model, metadata = out$meta, uncertainty = "Global") head(tidytext::tidy(prep), 10) ``` ## Categorical Covariates `plot_topic_effects_categorical()` plots topic proportions across the levels of a categorical covariate. `stm::plot.estimateEffect(..., method = "pointestimate", omit.plot = TRUE)` returns the model-based estimate and 95% CI per topic and level, reshaped here into the columns the plot expects. ```{r} ec <- stm::plot.estimateEffect(prep, "reference_type", method = "pointestimate", model = model, omit.plot = TRUE) effects_cat <- do.call(rbind, lapply(seq_along(ec$topics), function(i) { data.frame(topic = ec$topics[i], value = as.character(ec$uvals), proportion = as.numeric(ec$means[[i]]), lower = as.numeric(ec$cis[[i]][1, ]), upper = as.numeric(ec$cis[[i]][2, ])) })) plot_topic_effects_categorical(effects_cat) ``` ## Continuous Covariates `plot_topic_effects_continuous()` plots topic proportions across a continuous covariate. The same `plot.estimateEffect()` call with `method = "continuous"` returns the estimate and CI over a grid of the covariate. ```{r} en <- stm::plot.estimateEffect(prep, "year", method = "continuous", model = model, omit.plot = TRUE) effects_cont <- do.call(rbind, lapply(seq_along(en$topics), function(i) { data.frame(topic = en$topics[i], value = en$x, proportion = as.numeric(en$means[[i]]), lower = as.numeric(en$ci[[i]][1, ]), upper = as.numeric(en$ci[[i]][2, ])) })) plot_topic_effects_continuous(effects_cont) ``` ## Embedding-based Topics The second type embeds documents with a transformer, reduces dimensionality, and clusters the embeddings into topics. `get_best_embeddings()` produces the embeddings and `fit_embedding_model()` clusters them; `generate_topic_labels()` drafts labels. These need Python (BERTopic) or an embedding API key, so they are shown without running. ```{r, eval = FALSE} embeddings <- get_best_embeddings(united_tbl$united_texts, provider = "openai") topics <- fit_embedding_model( united_tbl$united_texts, method = "umap_hdbscan", n_topics = 10, precomputed_embeddings = embeddings ) embedding_labels <- generate_topic_labels(topics$topic_terms, provider = "openai") ``` | Provider | Model | Notes | |----------|-------|-------| | Sentence Transformers (default) | all-MiniLM-L6-v2, all-mpnet-base-v2 | Requires Python | | OpenAI | text-embedding-3-small, text-embedding-3-large | API key | | Gemini | gemini-embedding-001 | API key | R backend methods follow `{dimred}_{clustering}` (e.g., `"umap_dbscan"`, `"tsne_kmeans"`, `"pca_hierarchical"`). See: [BERTopic](https://maartengr.github.io/BERTopic/) · [Sentence-BERT](https://www.sbert.net/) ## STM vs Embedding | Feature | STM | Embedding | |---------|-----|-----------| | Speed | Fast | Medium | | Metadata covariates | Yes | No | | Short texts | Weaker | Strong | | Multilingual | No | Yes | | Dependencies | R only | Python or API |