--- title: "Semantic Analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Semantic Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE) ``` Semantic analysis examines relationships of meaning between words and documents. The sections below follow the Shiny app's **Semantic Analysis** tabs in order. ## Setup A 150-document subset of `SpecialEduTech` keeps the build fast; the full dataset works the same way. ```{r} library(TextAnalysisR) mydata <- SpecialEduTech[1:150, ] united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract")) tokens <- prep_texts(united_tbl, text_field = "united_texts", remove_stopwords = TRUE) dfm_object <- quanteda::dfm(tokens) ``` ## Word Co-occurrence `word_co_occurrence_network()` builds a network of words that co-occur across documents, with community detection and centrality metrics. ```{r} network <- word_co_occurrence_network( dfm_object, co_occur_n = 50, top_node_n = 30, node_label_size = 22, community_method = "leiden" ) network$plot network$table network$summary ``` ## Word Correlation `word_correlation_network()` connects words by phi correlation of their document co-occurrence patterns. ```{r} corr_network <- word_correlation_network( dfm_object, common_term_n = 15, corr_n = 0.3, community_method = "leiden" ) corr_network$plot corr_network$table ``` ## Document Similarity `semantic_similarity_analysis()` compares documents by words, n-grams, or embeddings (embeddings require Python). The example below uses word features on a subset and renders a cosine similarity heatmap. ```{r} subset_texts <- united_tbl$united_texts[1:10] similarity <- semantic_similarity_analysis( subset_texts, document_feature_type = "words", similarity_method = "cosine", verbose = FALSE ) plot_similarity_heatmap(similarity$similarity_matrix, method_name = "Cosine") ``` | Method | Description | Requires | |--------|-------------|----------| | Words | Word-frequency vectors (bag-of-words) | none | | N-grams | Word-sequence vectors | none | | Embeddings | Transformer sentence vectors | Python | ## Comparative Analysis Comparative analysis scores how similar documents in one category are to a reference category. `extract_cross_category_similarities()` pulls cross-category pairs from a similarity matrix and `analyze_similarity_gaps()` summarizes the differences. The example uses the first 30 documents. ```{r} term_matrix <- as.matrix(dfm_object)[1:30, ] normalized <- term_matrix / sqrt(rowSums(term_matrix^2)) sim_matrix <- normalized %*% t(normalized) docs_data <- data.frame( display_name = paste0("doc", 1:30), reference_type = quanteda::docvars(dfm_object, "reference_type")[1:30] ) dimnames(sim_matrix) <- list(docs_data$display_name, docs_data$display_name) cross <- extract_cross_category_similarities( sim_matrix, docs_data, reference_category = "journal_article", category_var = "reference_type", id_var = "display_name" ) gaps <- analyze_similarity_gaps(cross) gaps$summary_stats ``` ## Semantic Search `run_rag_search()` retrieves the documents most relevant to a query using embedding similarity. It requires an OpenAI or Gemini API key; see [AI Integration](ai_integration.html). ```{r, eval = FALSE} results <- run_rag_search( query = "math intervention for students with disabilities", documents = united_tbl$united_texts, provider = "openai" ) ``` ## Sentiment & Emotion `sentiment_lexicon_analysis()` scores documents with the Bing, AFINN, or NRC lexicon. The Bing example runs below. ```{r} sentiment <- sentiment_lexicon_analysis(dfm_object, lexicon = "bing") plot_sentiment_distribution(sentiment$document_sentiment) ``` NRC also yields discrete emotions for `plot_emotion_radar()`. NRC downloads through `textdata` behind a license-gated prompt, so the emotion example is shown but not run. ```{r, eval = FALSE} emotion <- sentiment_lexicon_analysis(dfm_object, lexicon = "nrc") plot_emotion_radar(emotion$emotion_scores) ``` `plot_sentiment_by_category()` compares sentiment across a metadata category after joining it to the scored documents. Transformer (`sentiment_embedding_analysis()`) and LLM (`analyze_sentiment_llm()`) scoring require Python or an API key and are not run here. ```{r} scored <- sentiment_lexicon_analysis(dfm_object, lexicon = "bing")$document_sentiment scored$reference_type <- quanteda::docvars(dfm_object, "reference_type")[ match(scored$document, quanteda::docnames(dfm_object)) ] plot_sentiment_by_category(scored, category_var = "reference_type") ``` ## Document Groups `cluster_embeddings()` groups documents from a feature matrix. K-means and hierarchical clustering run in base R; the app's default UMAP + DBSCAN path requires Python. `generate_cluster_labels_auto()` labels each group with its most distinctive terms. ```{r} data_matrix <- as.matrix(dfm_object) groups <- cluster_embeddings(data_matrix, method = "kmeans", n_clusters = 5, verbose = FALSE) groups$n_clusters labels <- generate_cluster_labels_auto(data_matrix, groups$clusters, method = "tfidf", n_terms = 3) labels ``` A 2-D map of the groups uses `reduce_dimensions()` (PCA runs in base R; t-SNE and UMAP need their packages). ```{r} coords <- reduce_dimensions(data_matrix, method = "PCA", n_components = 2, verbose = FALSE) head(coords$reduced_data) ```