AI & MACHINE LEARNING

BERTopic

Modern topic modeling algorithm combining contextual embeddings (BERT, Sentence-Transformers), dimensionality reduction (UMAP), clustering (HDBSCAN), and c-TF-IDF. Grootendorst (2022) consolidated. Often surpasses LDA in semantic coherence on small and medium corpora.

Extended definition

BERTopic is a modern topic modeling algorithm combining dense neural representations with classical clustering and term-ranking techniques. Standard four-step pipeline: (1) Embedding — documents are converted to dense vectors via Sentence-Transformers or compatible models (paraphrase-multilingual-MiniLM-L12-v2 for multilingual, all-MiniLM-L6-v2 for English); (2) Dimensionality reduction — UMAP reduces embeddings from hundreds of dimensions to 5-10 dimensions preserving local structure; (3) Clustering — HDBSCAN groups reduced embeddings in variable-density clusters, with explicit outlier handling (documents without clear cluster); (4) Term ranking — c-TF-IDF (class-based TF-IDF) computes word importance per cluster, comparing local to global frequency. Grootendorst (2022, arXiv 2203.05794) consolidated the framework; the Python bertopic package is the dominant implementation. Egger and Yu (2022, Frontiers in Sociology) compared BERTopic with LDA, NMF, and Top2Vec on tweet corpora, showing that BERTopic often surpasses in semantic coherence measured by automatic evaluation, especially on small to medium corpora. Advantages over LDA: contextual embeddings (the same term in different contexts gets distinct representations), modularity (each stage can be replaced), native multilingual support, specialized models (BERTopic guidance, BERTopic class-based, BERTopic dynamic).

When it applies

BERTopic applies in topic modeling on small and medium corpora (hundreds to tens of thousands of documents) where traditional LDA suffers from insufficient statistics. It applies in multilingual corpora with pretrained models like paraphrase-multilingual-MiniLM. It applies in corpora with modern vocabulary (slang, neologisms, technical terms) that LDA does not capture well. It applies in corpora with short documents (tweets, comments, product descriptions) where classical bag-of-words loses context. It applies in exploratory analysis pipelines requiring interactive visualization: BERTopic integrates with UMAP 2D for topic maps. It applies in computational social sciences, digital media, and digital humanities research with contemporary corpora.

When it does not apply

It does not apply to very large corpora (millions of documents) without adjustment: embedding and UMAP/HDBSCAN scale with nn; scalable alternatives or sampling are needed. It does not apply to very old corpora where modern pretrained models do not cover the vocabulary (18th-century texts, historical records in obsolete spelling) — fine-tuning or specific models may be needed. It does not apply as a substitute for LDA when cross-study reproducibility is critical: LDA has more consolidated standardization in bibliometric literature. It does not apply in problems where pre-defined topics exist: supervised classification is appropriate. In highly specialized domains (technical medicine, jurisprudence), general pretrained models can be inadequate — domain fine-tuning is needed.

Applications by field

Social media analysis: thematic clustering on Twitter, Reddit, YouTube comments — Egger & Yu (2022) is a canonical example. — Modern bibliometrics: thematic clustering on scientific corpora, alternative to traditional LDA; integration with VOSviewer. — Customer experience: review analysis, customer feedback, call-center transcripts. — Digital humanities: analysis of contemporary literary and historical corpora.

Common pitfalls

The first pitfall is treating BERTopic as a black box: each of the four stages (embedding, UMAP, HDBSCAN, c-TF-IDF) has hyperparameters affecting results; documented experimentation is needed. The second is not comparing with LDA as baseline: in some corpora, LDA has equivalent or superior performance; reporting comparison adds rigor. The third is interpreting topics without human validation: BERTopic produces statistical clusters; perceived coherence requires validation. The fourth is assuming that scaling up the embedding model (BERT-large vs. MiniLM) always improves — in practice, MiniLMs are often sufficient and much faster. The fifth is failing to document the exact embedding model version and UMAP/HDBSCAN seed: reproducibility requires complete specification.

Last updated —