AI & MACHINE LEARNING

RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation: an architecture combining retrieval over an external document base with a language generation model. The current standard for answering questions with documentary grounding and reducing hallucination in LLMs.

Extended definition

RAG (Retrieval-Augmented Generation) is an architecture combining two components: a retrieval system that finds relevant documents given a query, and a language generation model that produces the answer conditioned on the retrieved documents. The canonical formalization is Lewis et al. (2020, NeurIPS), which showed significant gains in open-domain question-answering tasks by combining a dense retriever with a generation model. Typical operation: the query is converted to an embedding vector, compared by cosine similarity against a pre-indexed document base (FAISS, Qdrant, Weaviate, Pinecone), the top-kk most similar documents are retrieved and injected as context into the generator’s prompt (LLM). Karpukhin et al. (2020) is a parallel reference for dense passage retrieval, a common component in modern RAG pipelines. The architecture’s central motivation is twofold: it lets LLMs answer about knowledge after the training cutoff or specific to a domain without fine-tuning, and substantially reduces hallucination by anchoring the answer in retrieved factual text.

When it applies

RAG is appropriate when the system must answer questions based on an updated or domain-specific knowledge corpus — scientific literature databases, case law, technical documentation, institutional records, product manuals. It is standard in enterprise chatbots, semantic search tools over PDF libraries, research assistants, and Q&A systems over documentation. It is also the choice when fine-tuning cost is prohibitive or when content must be updatable without retraining.

When it does not apply

RAG does not apply when knowledge is already in the pre-trained model and retrieval gains do not justify the architectural overhead. It does not apply when retrieval degrades quality — in domains where the generative model already has strong coverage and adding retrieved context introduces noise. It does not replace pre-training or fine-tuning when the problem is understanding domain-specific vocabulary or structure — RAG adds factual knowledge, not processing capability. At very large scale (millions of queries/day), the latency of retrieval + generation can be unfeasible without aggressive optimization. RAG does not solve bias or hallucination when the retrieved corpus itself has problems — anchoring in poor sources produces confident wrong answers.

Applications by field

Academic research and bibliometrics: Q&A systems over scientific literature, paper summarization, semantic search in specialized corpora. — Law: queries over up-to-date case law, precedent search, contract analysis against a regulatory base. — Health: clinical decision support based on recent biomedical literature, search over updated guidelines. — Enterprise: product assistants based on internal documentation, onboarding tools, technical knowledge bases.

Common pitfalls

The first pitfall is relying on RAG without evaluating retrieval quality — a system with 60% recall brings relevant documents only half the time, and generated answers reflect this gap. Separate evaluation of the retriever component is essential. The second is ignoring computational and storage cost of vector indexing — dense embeddings for a large corpus have non-trivial costs in FAISS/Qdrant/Pinecone. The third is trusting that RAG eliminates hallucination: when the retrieved corpus is insufficient or inappropriate, the generation model still produces confident, factually incorrect answers. The fourth is mixing embeddings from different models in the same index — vectors are not comparable across models, and search produces meaningless results. The fifth is failing to cite retrieved sources in the final answer: for academic and legal applications, citing is mandatory, and RAG architecture facilitates this, but requires explicit engineering.

Last updated —