AI & MACHINE LEARNING

BERT

Pre-trained language model based on the Transformer architecture, developed by Google in 2018. Trained by *masked language modeling*, BERT established the pre-training + fine-tuning paradigm that dominated natural language processing until the generative LLM era.

Extended definition

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model proposed by Devlin et al. (2018) at Google AI. Its central innovation is bidirectional pre-training via masked language modeling: during training, 15% of tokens in each sentence are masked, and the model learns to predict them based on context to the left and right simultaneously — overcoming limitations of prior autoregressive models, which processed only one direction. The model is built on the Transformer architecture (Vaswani et al., 2017), using only the encoder component. After pre-training on large corpora (Wikipedia + BookCorpus in the original version), BERT is fine-tuned to specific tasks — sentence classification, token classification, sentence pairs, question answering — with a small additional head trained on task data. This pre-training + fine-tuning paradigm became standard in NLP between 2018 and 2022.

When it applies

BERT and variants (RoBERTa, DistilBERT, BERTimbau for Portuguese) remain appropriate for classification and information extraction tasks in text, especially when labeled data is available in modest amounts (hundreds to thousands of examples) and an efficient inference model is needed. It is also the default choice for NER (named entity recognition), domain-specific sentiment classification, sentence similarity, and extractive question answering. Derived models such as Sentence-BERT are standard for generating sentence embeddings used in semantic search.

When it does not apply

BERT is not the right choice for free-form text generation — it is encoder-only, not trained to produce coherent continuations. For generation, GPT and other autoregressive models are appropriate. It is not the best option when the problem requires chain-of-thought reasoning or following complex instructions — instruction-tuned LLMs (GPT-4, Claude, Llama) outperform BERT on these tasks by a substantial margin. In domains without labeled data or with highly specific vocabulary, generic BERT shows degradation; in those cases, domain-adaptive pretraining or specific models (BioBERT, SciBERT, LegalBERT) are usually preferable.

Applications by field

Academic NLP research: corpus analysis, document classification, relation extraction. — Health: BioBERT, ClinicalBERT for information extraction from medical records and biomedical literature. — Law and social sciences: case law classification, argument extraction, parliamentary discourse analysis. — Bibliometrics: automatic article classification, topic detection, citation analysis.

Common pitfalls

The first pitfall is using generic-domain pre-trained BERT for a specialized task without adaptation — performance is often inferior to classical models with well-designed features. The second is ignoring computational cost: BERT-base has 110M parameters and BERT-large 340M, with high inference time without GPU; distilled alternatives (DistilBERT) or smaller models (TinyBERT) are preferable in production. The third is treating embeddings extracted from BERT as universal — they are in fact contextual, and the same word produces different vectors in different sentences. The fourth is assuming performance equivalent to that reported in papers: benchmarks such as GLUE saturate, but real tasks with noisy, multilingual, or rare-domain text show significant drops. The fifth is not accounting for the 512-token limit: long documents require a chunking strategy or models such as Longformer.

Last updated —