RAG (Retrieval-Augmented Generation)

Giving a language model access to external knowledge at the moment it answers a question, instead of relying solely on what it memorised during training.


What is it?

A large language model knows only what was in its training data. That data has a cutoff date, may contain errors, and cannot include private or proprietary information that was never published. When you ask an LLM a question that requires knowledge beyond its training — yesterday’s news, your company’s internal policies, the contents of a specific document — it has two options: admit it does not know, or guess. The guessing is what produces hallucinations.1

Retrieval-Augmented Generation (RAG) solves this by adding a retrieval step before generation. Instead of answering from memory alone, the system first searches an external knowledge source (a database, a document collection, a knowledge graph, a search engine), retrieves the most relevant information, and then feeds that information to the LLM alongside the original question. The model generates its answer grounded in the retrieved evidence rather than relying on what it memorised.2

The concept was introduced by Lewis et al. at Facebook AI Research in 2020. Their key insight was the distinction between parametric knowledge (what the model has encoded in its neural network weights during training) and non-parametric knowledge (what can be looked up at query time from an external source). RAG combines both: the model uses its parametric knowledge for language understanding and reasoning, and non-parametric knowledge for factual grounding.3

The parent concept, llm-pipelines, frames RAG as the most common tool-use pattern — a pipeline stage that extends the model beyond text-in, text-out by connecting it to external data sources. RAG is not a single tool call; it is a multi-step pipeline in itself: retrieve, augment, generate.

In plain terms

RAG is like an open-book exam. Without RAG, the LLM takes a closed-book exam — it can only use what it memorised. With RAG, the LLM gets to look things up in a reference book before answering. The answer is still in the model’s own words, but the facts come from the reference material.


At a glance


How does it work?

RAG operates as a three-stage pipeline. Each stage has a distinct job, and the output of each becomes the input of the next.

1. Retrieve — find the relevant information

The retrieval stage takes the user’s question and searches a knowledge base for the most relevant documents, passages, or data points. This is the “R” in RAG, and it is where the system connects to external knowledge.2

The knowledge base can be anything searchable: a vector database of document embeddings, a traditional search index, a knowledge graph, a SQL database accessed via an API, or even a live web search engine. What matters is that the retrieval mechanism returns content that is relevant to the question.4

The most common approach uses vector similarity search. Documents in the knowledge base are converted into numerical representations (embeddings) that capture their meaning. The user’s question is also converted into an embedding. The system then finds the documents whose embeddings are closest to the question’s embedding — the ones most semantically similar.4

Think of it like...

A librarian who hears your question, goes into the stacks, and comes back with the three most relevant books opened to the right pages. The librarian does not answer the question — they find the material that contains the answer.

2. Augment — combine the question with the evidence

The augmentation stage takes the retrieved documents and combines them with the original question into a single prompt that the LLM will process. This is where the “A” in RAG happens: the model’s context is augmented with external knowledge.2

A typical augmented prompt looks like this:

Based on the following documents, answer the user's question.

[Document 1: HR Policy v4.2, Section 12...]
[Document 2: Employee Handbook, Chapter 8...]
[Document 3: Benefits FAQ, Question 47...]

Question: What is our parental leave policy?

The augmentation step is where context-cascading meets RAG. The retrieved documents become part of the task context — a dynamically assembled layer that changes with every question. The quality of this step depends on how well the retrieved documents are selected, ranked, and formatted.5

Think of it like...

The librarian places the relevant books on your desk, open to the right pages, and says “Here is what I found — now answer your question using these.” The books are the augmentation; your reading and answering is the generation.

Key distinction

Augmentation is not just concatenation. Good augmentation involves ranking documents by relevance, truncating to fit the model’s context window, and formatting the context so the model can easily distinguish source material from the question. Poor augmentation dumps irrelevant text into the prompt and degrades the answer.

3. Generate — answer using the retrieved context

The generation stage is where the LLM produces its response. The model receives the augmented prompt — the question plus the retrieved evidence — and generates an answer grounded in that evidence. This is the “G” in RAG.2

The model is not just regurgitating the retrieved text. It is synthesising, summarising, and reasoning over the evidence to produce a coherent answer in natural language. It uses its parametric knowledge (language understanding, reasoning ability, writing skill) combined with the non-parametric knowledge (the retrieved documents) to produce an answer that is both fluent and factually grounded.3

When RAG works well, the model can cite its sources: “According to HR Policy v4.2, parental leave is 16 weeks…” This traceability is one of RAG’s most valuable properties — the user can verify the answer against the original source.1


Parametric vs non-parametric knowledge

This distinction is central to understanding why RAG exists and what it solves.3

Parametric knowledgeNon-parametric knowledge
Where it livesEncoded in the model’s weightsStored in an external source
When it was learnedDuring training (has a cutoff date)Retrieved at query time (always current)
Can it be updated?Only by retraining (expensive)Yes, by updating the knowledge base (cheap)
ScopeWhatever was in the training dataWhatever you put in the knowledge base
Example”Python is a programming language""Our Q3 revenue was CHF 2.4 million”

RAG’s power comes from combining both. The model uses parametric knowledge to understand language, reason, and generate fluent text. It uses non-parametric knowledge to ground its answers in specific, current, verifiable facts.3

Key distinction

Retraining a model to update its knowledge is like reprinting an entire textbook to fix one fact. RAG is like giving the textbook reader access to a search engine — the textbook stays the same, but the reader can always look up the latest information.


Why RAG reduces hallucination

LLMs hallucinate when they generate plausible-sounding text that is not grounded in fact. This happens most often when the model is asked about topics outside its training data or when it lacks confidence and fills the gap with statistically likely but factually wrong text.1

RAG reduces hallucination by giving the model evidence to anchor its response. When the prompt contains the actual text of a policy document, the model is far less likely to invent a policy. The retrieved context acts as a constraint — it narrows the space of plausible answers from “anything that sounds right” to “anything supported by this evidence.”5

RAG does not eliminate hallucination entirely. The model can still misinterpret the retrieved documents, ignore relevant passages, or generate answers that go beyond what the evidence supports. But empirical results consistently show that RAG systems produce significantly fewer factual errors than models operating without retrieval.1


Why do we use it?

Key reasons

1. Access to current information. Training data has a cutoff date. RAG lets the model answer questions about events, documents, and data that did not exist when it was trained, by retrieving current information at query time.1

2. Reduced hallucination. By grounding answers in retrieved evidence, RAG significantly reduces the frequency of the model generating plausible but false information. The evidence acts as a factual constraint.5

3. Verifiable answers. RAG enables citation and source attribution. The user can trace the answer back to the original document, which builds trust and enables fact-checking.2

4. Domain-specific knowledge without retraining. Instead of fine-tuning or retraining a model on proprietary data (expensive and time-consuming), you can give it access to that data through retrieval. Update the knowledge base and the model immediately has access to the new information.4


When do we use it?

  • When the LLM needs to answer questions about private or proprietary data (company policies, internal documentation, customer records)
  • When factual accuracy matters and hallucination is unacceptable (legal, medical, financial contexts)
  • When the required knowledge changes frequently and retraining is impractical
  • When users need to verify answers against original sources
  • When building documentation chatbots, enterprise search, or knowledge assistants
  • When augmenting a general-purpose model with domain-specific expertise

Rule of thumb

If the answer to a question exists in a specific document or database and the LLM does not have that document in its training data, RAG is the right pattern. If the question requires general reasoning or creative generation rather than factual recall, RAG adds complexity without much benefit.


How can I think about it?

The open-book exam

RAG is an open-book exam for an LLM.

  • In a closed-book exam (standard LLM), the student relies entirely on memory. If they studied the right material, they answer well. If not, they guess — and the guesses sound confident but may be wrong.
  • In an open-book exam (RAG), the student can look up information in reference materials during the test. They still need to understand the question, find the right pages, and synthesise an answer — but the facts come from the book, not from memory.
  • The retrieval step is flipping to the right chapter. The augmentation step is reading the relevant passages. The generation step is writing the answer in your own words based on what you read.
  • The quality of the exam depends on both the student’s skill (the model) and the quality of the reference books (the knowledge base). A brilliant student with a bad textbook still struggles.

The research assistant

RAG is like having a research assistant who works alongside you.

  • You ask a question: “What were the key findings of the 2025 climate report?”
  • Your assistant goes to the library (retrieval), finds the report, and brings back the relevant sections
  • They place the excerpts on your desk with the question at the top (augmentation)
  • You read the excerpts and write a summary in your own words (generation)
  • If someone challenges your summary, you can point to the original report as your source (citation and verifiability)
  • Without the assistant, you would have to answer from memory — and if you never read the report, you would be guessing

The assistant does not write the summary. They ensure you have the right materials to write an accurate one. That is exactly what the retrieval step does for the LLM.


Concepts to explore next

ConceptWhat it coversStatus
knowledge-graphsThe structured data sources that RAG systems can retrieve fromcomplete
jsonThe data format commonly used to store and exchange retrieved knowledgecomplete
apisThe interfaces through which RAG systems query external data sourcescomplete
databasesThe storage systems that serve as knowledge bases for retrievalstub

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.


Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    LP[LLM Pipelines] --> PR[Prompt Routing]
    LP --> CC[Context Cascading]
    LP --> RAG[RAG]
    KG[Knowledge Graphs] -.->|prerequisite| RAG
    JSON[JSON] -.->|prerequisite| RAG
    style RAG fill:#4a9ede,color:#fff

Related concepts:

  • apis — RAG systems use APIs to query external knowledge sources (search engines, databases, document stores)
  • databases — the storage layer that holds the knowledge base a RAG system retrieves from
  • machine-readable-formats — retrieved data must be in a format the system can parse and inject into the prompt
  • context-cascading — the augmentation step in RAG is a dynamic form of context loading, assembling task-specific context at query time

Sources


Further reading

Resources

Footnotes

  1. AWS. (2026). What is RAG? Retrieval-Augmented Generation AI Explained. Amazon Web Services. 2 3 4 5

  2. AI Agents Plus Editorial. (2026). RAG Explained: Complete Retrieval Augmented Generation Guide 2026. AI Agents Plus. 2 3 4 5

  3. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. 2 3 4

  4. AI Engineer Lab. (2026). RAG Architecture Explained for Engineers. AI Engineer Lab. 2 3

  5. PE Collective. (2026). RAG Architecture: How to Build Retrieval-Augmented Generation Systems. PE Collective. 2 3