Transformer architecture — Glossary Aria Research

Extended definition

The Transformer architecture was proposed by Vaswani et al. (2017) in the paper “Attention Is All You Need,” in deliberate rupture with recurrent networks (RNNs, LSTMs) that had dominated NLP until then. The central innovation is the self-attention mechanism, which allows the model to compute, for each token in a sequence, a weighted combination of all other tokens — capturing long-range dependencies without the sequential bottleneck of RNNs. The fundamental operation is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where $Q$ , $K$ , and $V$ are matrices of queries, keys, and values, and $d_k$ is the key dimension. This operation is replicated across multiple heads (multi-head attention), stacked across multiple layers, and combined with positional encoding to preserve order information. The original architecture has an encoder component (used in BERT) and a decoder (used in GPT); the two lineages dominate, respectively, understanding and generation. Massively parallelizable training on GPU/TPU made it possible to scale to billions of parameters — a practical impossibility in recurrent architectures.

When it applies

Transformer is today the standard architecture for any sequence-processing task: natural language, code, time series with relevant sequential signal, biological sequences (DNA, proteins via AlphaFold), music, and increasingly computer vision (Vision Transformers). For any new NLP project with reasonable data, the starting choice is a pre-trained Transformer variant.

When it does not apply

It is not necessarily the best choice for very short sequences with strong local temporal signal — simple RNNs or classical models may suffice and be cheaper. It is not appropriate for purely tabular problems, where gradient boosting (XGBoost, LightGBM) consistently outperforms neural approaches. In production with severe latency or energy constraints (mobile, edge devices), lighter or distilled architectures are usually preferable. The quadratic cost in sequence length ( $O(n^2)$ in standard attention) limits direct use on very long documents without specific variants (Longformer, Performer, sparse attention).

Applications by field

— NLP in general: practically universal state of the art since 2018; every relevant task adopts Transformer. — Computational biology: AlphaFold 2 and 3, based on Transformer, revolutionized protein structure prediction. — Computer vision: Vision Transformers (ViT) compete with or surpass CNNs in classification and segmentation. — Multimodality: models such as CLIP, Flamingo, GPT-4V combine text and image in a unified Transformer architecture.

Common pitfalls

The first pitfall is treating Transformer as the solution for any supervised learning problem — for tabular data, time series with weak sequential signal, or simple regression, classical models often outperform at a fraction of the cost. The second is ignoring computational and environmental cost: training a large Transformer from scratch requires millions in hardware and has a documented carbon footprint; fine-tuning a pre-trained model is the viable option in almost all cases. The third is assuming that more parameters equal better performance — scaling laws establish diminishing returns, and small specialized models often outperform large generic ones on the target task. The fourth is neglecting the context limit: standard attention scales quadratically; long documents require sparse-attention variants, chunking, or specialized models. The fifth is trusting generic benchmark metrics (GLUE, SuperGLUE) without evaluating on one’s own domain — benchmark saturation does not guarantee production performance.