Multimodal Analysis

library(TextAnalysisR)

sample_text <- c(
  "Figure 1 shows the distribution of student outcomes.",
  "Table 2 reports the effect sizes for each intervention."
)
toks <- prep_texts(
  data.frame(united_texts = sample_text),
  text_field = "united_texts"
)
quanteda::ntoken(toks)

## text1 text2 
##     7     8

Extract text from PDFs with charts, diagrams, and images using vision AI. R-native pipeline – no Python required.

How It Works

Extracts text from each page using pdftools::pdf_text() (R-native)
Renders each page as a PNG image via pdftools::pdf_render_page()
Identifies sparse-text pages (< 500 characters) that likely contain figures
Sends only those pages to a vision LLM for description
Merges extracted text + image descriptions into a single text corpus

Functions

process_pdf_unified() runs the full pipeline with automatic fallback:

Multimodal (pdftools + vision LLM) – extracts text and describes visual content
Text-only (pdftools) – fallback when no vision provider is set

describe_image() describes a single base64-encoded PNG. Both require a vision-provider API key (OpenAI/Gemini) and network access; see their reference pages for usage.

Provider Comparison

Provider	Cost	Privacy	Accuracy	Setup
OpenAI	Per use	Cloud	Best	API key
Gemini	Free on hosted app (Google Cloud Research); otherwise per use	Cloud	Best	API key