Topic modeling (LDA) — Glossary Aria Research

Extended definition

Topic modeling with LDA (Latent Dirichlet Allocation) is a probabilistic generative model that discovers latent topics in a document corpus. The intuition: each document is generated as a mixture of topics (distribution $\theta_d$ over $K$ topics), and each topic is a distribution over words (distribution $\beta_k$ over the vocabulary). The formal generative model:

\theta_d \sim \text{Dir}(\alpha), \quad \beta_k \sim \text{Dir}(\eta), \quad z_{d,n} \sim \text{Cat}(\theta_d), \quad w_{d,n} \sim \text{Cat}(\beta_{z_{d,n}})

where $\alpha$ and $\eta$ are hyperparameters of the Dirichlet priors, $z_{d,n}$ is the topic assigned to the $n$ -th word of document $d$ , and $w_{d,n}$ is the observed word. Inference of latent distributions via variational Bayes or Gibbs sampling. Blei, Ng, and Jordan (2003, JMLR) formalized the model; Blei (2012, CACM) consolidated the accessible presentation for a broad audience. LDA dominated topic modeling for over a decade until the arrival of embedding-based methods (Top2Vec, BERTopic) leveraging dense neural representations — often with superior quality on small corpora or with modern vocabulary.

When it applies

LDA applies in exploratory analysis of large text corpora — news collections, scientific abstracts, transcripts, historical corpora — when the goal is to discover thematic structure without prior hypothesis about topics. It is standard in digital humanities (analysis of historical newspapers, literary corpora), bibliometrics (scientific literature mapping), political science (parliamentary discourse analysis), social media research (thematic clusters in tweets, posts). In exploratory qualitative research at scale, LDA offers a first approximation that can be refined with human coding. Typical application works with 5 to 50 topics; semantic coherence (measured via PMI) is the quality criterion.

When it does not apply

It does not apply to small corpora (< 1,000 documents): LDA needs sufficient statistics to stabilize distributions; alternatives are manual thematic analysis or BERTopic with transferred pretraining. It does not apply to very short documents (isolated tweets): bag-of-words loses essential contextual information — alternatives are specific short-text topic models. It does not apply as ground truth of “real topics”: LDA produces statistical groupings; interpretation as coherent topics requires human validation. It does not replace embedding-based models in modern corpora with varied vocabulary: Top2Vec and BERTopic often produce more coherent topics on contemporary corpora. It does not apply as a supervised classifier: LDA is unsupervised by construction.

Applications by field

— Digital humanities: analysis of literary, journalistic, and archival corpora at scales impossible for manual reading. — Bibliometrics: thematic mapping of scientific literature; LDA on Scopus/WoS abstracts. — Social media analysis: identification of emerging themes in large corpora of tweets, posts, comments. — Political analysis: parliamentary speeches, manifestos, communiqués — LDA as first thematic approximation.

Common pitfalls

The first pitfall is choosing the number of topics $K$ arbitrarily: metrics like semantic coherence (CV, UMass) and perplexity help, but the final decision should be informed by the research objective. The second is interpreting topics without human validation: a statistically well-defined topic can be noise when read. The third is failing to preprocess adequately: stopwords, stemming, minimum and maximum term frequency dramatically affect results. The fourth is trusting LDA on small corpora: BERTopic with pretrained models often outperforms in quality with the same data. The fifth is failing to document the exact implementation (gensim, sklearn, MALLET with hyperparameters) and seed: LDA results are sensitive to initialization and not fully deterministic without explicit control.