---
title: "Topic Modeling"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Topic Modeling}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

Topic modeling discovers latent themes in a text collection. The Shiny app offers two types:

- **Structural Topic Model (STM):** word-probability topics that incorporate document metadata as
  covariates. Runs in R.
- **Embedding-based topics:** transformer embeddings reduced and clustered into topics. Best for short
  texts and multilingual content. Requires Python (BERTopic) or an embedding API key.

The sections below follow the app's **Topic Modeling** tabs in order, first with STM (run live on the
bundled data) and then with the embedding-based type.

## Setup

A 150-document subset of `SpecialEduTech` keeps the build fast; the full dataset works the same way.

```{r}
library(TextAnalysisR)

mydata <- SpecialEduTech[1:150, ]
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts", remove_stopwords = TRUE)
dfm_object <- quanteda::dfm(tokens)
```

## Model Configuration

`find_optimal_k()` compares models across a range of K via `searchK` (semantic coherence, exclusivity,
held-out likelihood, residuals). The example uses a small document subset and a narrow K range to keep
the runtime short.

```{r}
k_search <- find_optimal_k(dfm_object, topic_range = 5:6)
names(k_search)
```

Fit a small STM for the examples below. `prevalence` makes topic proportions depend on document
metadata.

```{r}
out <- quanteda::convert(dfm_object, to = "stm")
out$meta$year <- as.numeric(out$meta$year)

model <- stm::stm(
  documents = out$documents,
  vocab = out$vocab,
  K = 5,
  prevalence = ~ reference_type + year,
  data = out$meta,
  max.em.its = 15,
  init.type = "Spectral",
  verbose = FALSE
)
```

See: [Structural Topic Model](https://www.structuraltopicmodel.com/) · [stm on CRAN](https://CRAN.R-project.org/package=stm)

## Word-Topic

`get_topic_terms()` returns the top terms per topic from the word-topic (beta) distribution.

```{r}
terms <- get_topic_terms(model, top_term_n = 10)
head(terms, 15)
```

## Content Generation

`generate_topic_content()` drafts topic-grounded content (survey items, research questions, theme
descriptions, policy recommendations, interview questions) from `get_topic_terms()` output. It calls an
LLM, so it needs an OpenAI or Gemini API key.

| Content type | Output |
|------|--------|
| `survey_item` | Likert-scale statement |
| `research_question` | Research question |
| `theme_description` | Qualitative theme summary |
| `policy_recommendation` | Action-oriented statement |
| `interview_question` | Open-ended question |

```{r, eval = FALSE}
labels <- generate_topic_labels(terms, provider = "openai")

content <- generate_topic_content(terms, content_type = "research_question", provider = "openai")
```

## Document-Topic

`calculate_topic_probability()` summarizes the corpus-level expected topic proportions (mean of the
per-document theta matrix).

```{r}
doc_topic <- calculate_topic_probability(model, top_n = 10, verbose = FALSE)
doc_topic
```

## Quotes

`stm::findThoughts()` returns the documents most representative of a topic.

```{r}
quotes <- stm::findThoughts(model, texts = united_tbl$united_texts, topics = 1, n = 3)
quotes
```

## Estimated Effects

`stm::estimateEffect()` regresses topic proportions on document covariates; `tidytext::tidy()` returns
the per-topic coefficient table shown in the app's Estimated Effects tab.

```{r}
prep <- stm::estimateEffect(1:5 ~ reference_type + year, model, metadata = out$meta, uncertainty = "Global")
head(tidytext::tidy(prep), 10)
```

## Categorical Covariates

`plot_topic_effects_categorical()` plots topic proportions across the levels of a categorical covariate.
`stm::plot.estimateEffect(..., method = "pointestimate", omit.plot = TRUE)` returns the model-based
estimate and 95% CI per topic and level, reshaped here into the columns the plot expects.

```{r}
ec <- stm::plot.estimateEffect(prep, "reference_type", method = "pointestimate",
                               model = model, omit.plot = TRUE)
effects_cat <- do.call(rbind, lapply(seq_along(ec$topics), function(i) {
  data.frame(topic = ec$topics[i], value = as.character(ec$uvals),
             proportion = as.numeric(ec$means[[i]]),
             lower = as.numeric(ec$cis[[i]][1, ]), upper = as.numeric(ec$cis[[i]][2, ]))
}))
plot_topic_effects_categorical(effects_cat)
```

## Continuous Covariates

`plot_topic_effects_continuous()` plots topic proportions across a continuous covariate. The same
`plot.estimateEffect()` call with `method = "continuous"` returns the estimate and CI over a grid of the
covariate.

```{r}
en <- stm::plot.estimateEffect(prep, "year", method = "continuous",
                               model = model, omit.plot = TRUE)
effects_cont <- do.call(rbind, lapply(seq_along(en$topics), function(i) {
  data.frame(topic = en$topics[i], value = en$x,
             proportion = as.numeric(en$means[[i]]),
             lower = as.numeric(en$ci[[i]][1, ]), upper = as.numeric(en$ci[[i]][2, ]))
}))
plot_topic_effects_continuous(effects_cont)
```

## Embedding-based Topics

The second type embeds documents with a transformer, reduces dimensionality, and clusters the
embeddings into topics. `get_best_embeddings()` produces the embeddings and `fit_embedding_model()`
clusters them; `generate_topic_labels()` drafts labels. These need Python (BERTopic) or an embedding API
key, so they are shown without running.

```{r, eval = FALSE}
embeddings <- get_best_embeddings(united_tbl$united_texts, provider = "openai")

topics <- fit_embedding_model(
  united_tbl$united_texts,
  method = "umap_hdbscan",
  n_topics = 10,
  precomputed_embeddings = embeddings
)

embedding_labels <- generate_topic_labels(topics$topic_terms, provider = "openai")
```

| Provider | Model | Notes |
|----------|-------|-------|
| Sentence Transformers (default) | all-MiniLM-L6-v2, all-mpnet-base-v2 | Requires Python |
| OpenAI | text-embedding-3-small, text-embedding-3-large | API key |
| Gemini | gemini-embedding-001 | API key |

R backend methods follow `{dimred}_{clustering}` (e.g., `"umap_dbscan"`, `"tsne_kmeans"`,
`"pca_hierarchical"`). See: [BERTopic](https://maartengr.github.io/BERTopic/) · [Sentence-BERT](https://www.sbert.net/)

## STM vs Embedding

| Feature | STM | Embedding |
|---------|-----|-----------|
| Speed | Fast | Medium |
| Metadata covariates | Yes | No |
| Short texts | Weaker | Strong |
| Multilingual | No | Yes |
| Dependencies | R only | Python or API |