---
title: "Lexical Analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Lexical Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

Lexical analysis examines word patterns, distinctiveness, and complexity. The sections below follow the
Shiny app's **Lexical Analysis** tabs in order.

## Setup

A 150-document subset of `SpecialEduTech` keeps the build fast; the full dataset works the same way.

```{r}
library(TextAnalysisR)

mydata <- SpecialEduTech[1:150, ]
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts", remove_stopwords = TRUE)
dfm_object <- quanteda::dfm(tokens)
```

## Linguistic Annotation

Token-level annotation (lemmas, part-of-speech, morphology, dependencies, named entities) uses spaCy
through reticulate, so the examples below require Python and are not run here.

### Part-of-Speech Tags

`extract_pos_tags()` returns one row per token with `doc_id, token, lemma, pos, tag`. Universal POS tags
include NOUN, VERB, ADJ, ADV (content words), PROPN (proper nouns), and DET, ADP, PRON (function words).

```{r, eval = FALSE}
pos <- extract_pos_tags(united_tbl$united_texts)
```

### Morphological Features

`extract_morphology()` extracts grammatical features such as Number (Sing/Plur), Tense (Past/Pres/Fut),
VerbForm, Person, and Case.

```{r, eval = FALSE}
morphology <- extract_morphology(united_tbl$united_texts)
```

### Named Entity Recognition

`extract_named_entities()` tags entities such as PERSON, ORG, GPE/LOC, and DATE/MONEY/PERCENT.

```{r, eval = FALSE}
entities <- extract_named_entities(united_tbl$united_texts)
```

## Frequency Trends

`plot_word_frequency()` shows the most frequent terms in the document-feature matrix.

```{r}
plot_word_frequency(dfm_object, n = 20)
```

## Keywords

### TF-IDF

`extract_keywords_tfidf()` weights terms that are frequent in a document but rare across the corpus,
surfacing distinctive vocabulary.

```{r}
keywords <- extract_keywords_tfidf(dfm_object, top_n = 10)
plot_tfidf_keywords(keywords)
```

### Statistical Keyness

`extract_keywords_keyness()` identifies terms that distinguish one group from the rest using a
log-likelihood (G^2) statistic.

```{r}
keyness <- extract_keywords_keyness(
  dfm_object,
  target = quanteda::docvars(dfm_object, "reference_type") == "journal_article"
)
plot_keyness_keywords(keyness)
```

### Comparison

`plot_keyword_comparison()` places TF-IDF scores next to term frequency for the top keywords.

```{r}
plot_keyword_comparison(keywords, top_n = 10)
```

## Lexical Diversity

`lexical_diversity_analysis()` reports vocabulary-richness indices. MTLD and MATTR are stable across text
lengths; TTR and CTTR are length-sensitive.

```{r}
diversity <- lexical_diversity_analysis(dfm_object)
plot_lexical_diversity_distribution(diversity$lexical_diversity, metric = "TTR")
```

| Metric | Description | Note |
|--------|-------------|------|
| TTR | Types / Tokens | Length-sensitive |
| CTTR | Types / sqrt(2 × Tokens) | Partly length-corrected |
| MATTR | Moving-average TTR | Stable across lengths |
| MTLD | Mean length maintaining TTR | Length-independent |
| Maas | Log-based index | Lower = more diverse |

## Readability

`calculate_text_readability()` computes grade-level and reading-ease indices from sentence and word
structure.

```{r}
readability <- calculate_text_readability(united_tbl$united_texts)
plot_readability_distribution(readability, metric = "flesch")
```

| Metric | Basis | Output |
|--------|-------|--------|
| Flesch Reading Ease | Sentence length + syllables | 0-100 (higher = easier) |
| Flesch-Kincaid | Sentence length + syllables | Grade level |
| Gunning Fog | Sentence length + complex words | Years of education |
| SMOG | Polysyllabic words | Years of education |
| ARI | Characters per word | Grade level |
| Coleman-Liau | Letters per 100 words | Grade level |

## Log Odds Ratio

`calculate_log_odds_ratio()` compares term frequencies between categories to find distinctive vocabulary.

```{r}
log_odds <- calculate_log_odds_ratio(
  dfm_object,
  group_var = "reference_type",
  comparison_mode = "binary",
  top_n = 15
)
plot_log_odds_ratio(log_odds)
```

`calculate_weighted_log_odds()` weights the ratio by a z-score (Monroe et al.), so reliably distinctive
terms rank above rare terms with extreme ratios (uses the tidylo package).

```{r}
weighted_odds <- calculate_weighted_log_odds(
  dfm_object,
  group_var = "reference_type",
  top_n = 15
)
plot_weighted_log_odds(weighted_odds)
```

## Lexical Dispersion

`calculate_lexical_dispersion()` shows where selected terms appear across documents (an X-ray plot).

```{r}
dispersion <- calculate_lexical_dispersion(tokens[1:50], terms = c("education", "technology"))
plot_lexical_dispersion(dispersion)
```

## Multi-Word Expressions

Multi-word (n-gram) detection belongs to the **Preprocess → Multi-Word Dictionary** step in the app.
`detect_multi_words()` returns a collocations table to feed `quanteda::tokens_compound()`.

```{r}
compounds <- detect_multi_words(tokens, min_count = 10)
head(compounds, 10)
```