Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Metadata

Reading status: read complete
Year: 2020
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2020-rag_2020.pdf
Extracted text: 2020-rag_2020.txt
PDF URL: https://arxiv.org/pdf/2005.11401.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly states the device split. The authors train all RAG models and BART baselines in Fairseq using mixed-precision floating point arithmetic, distributed across 8 NVIDIA V100 GPUs with 32GB each. They also note that training and inference can be run on one GPU. The neural generator and trainable query encoder live on GPU, while the document memory is not forced into accelerator memory.

The non-parametric memory is a dense vector index over the December 2018 Wikipedia dump. Wikipedia is split into disjoint 100-word chunks, producing 21M documents. The document encoder computes an embedding for each chunk, and FAISS with a Hierarchical Navigable Small World approximation performs Maximum Inner Product Search. The authors report that FAISS MIPS is sufficiently fast on CPU, so they store the document index vectors on CPU, requiring about 100GB of CPU memory for all Wikipedia. After submission, they compress the FAISS index and reduce the CPU memory requirement to 36GB.

The trainable parameter setup is also explicit. BART-large contributes 406M trainable parameters, and the paper reports 626M trainable parameters in total. The DPR document encoder is used to build the index but is kept fixed during RAG training, avoiding repeated re-embedding of the 21M-document corpus.

Bottleneck

The bottleneck is factual memory under accelerator constraints. A closed-book seq2seq model has to compress knowledge into parameters, so updating facts means more neural training and serving a larger model. But storing all of Wikipedia as dense vectors on GPU would be wasteful, and retraining a document encoder would force periodic index rebuilds over 21M chunks. The device tradeoff is clear: use CPU RAM for large external memory and reserve GPU compute for dense neural scoring.

There is also a test-time bottleneck. RAG retrieves top-K documents and marginalizes generation over them, either once per sequence or at each token. Increasing K can improve answer recall and factuality, but it adds CPU search, host-device movement, larger encoder inputs, and more generator work. The method therefore sits between parametric scaling and classical retrieval: its performance depends on retrieval quality, but its latency depends on how many retrieved passages the generator conditions on.

Method Adaptation

RAG adapts a seq2seq Transformer to this heterogeneous memory hierarchy. The retriever is DPR-style: a query encoder maps the input into a dense vector, and FAISS searches the fixed document index for top-K Wikipedia chunks. The generator is BART-large. For each retrieved document, the input and document text are concatenated, and BART generates an output conditioned on that pair. RAG-Sequence assumes one latent document explains the whole output; RAG-Token allows different retrieved documents to explain different generated tokens.

The most important engineering choice is freezing the document encoder. The authors fine-tune the query encoder and BART generator, but not the document encoder used to construct the index. This keeps the CPU index valid during training. It sacrifices some end-to-end adaptation, but it avoids rebuilding and redistributing a 21M-vector index whenever the document representation changes.

The top-K marginalization is also a batching adaptation. Retrieved documents create multiple generator contexts for the same input, but those contexts can be processed by the same BART model under mixed precision on V100s. Retrieval is fast enough on CPU to be included in the training and inference path, which is what lets the method use much more factual memory than would fit naturally in model weights.

Evidence

The benchmark evidence ties directly to the memory split. On open-domain QA, RAG-Sequence reaches 44.5 exact match on Natural Questions, compared with DPR at 41.5 and REALM at 40.4 in the same table. RAG-Token reaches 44.1 on NQ and 45.5 on WebQuestions, while RAG-Sequence reaches 52.2 on CuratedTrec. These gains are achieved with far fewer trainable parameters than closed-book T5-11B systems; the source text contrasts RAG's 626M trainable parameters with an 11B-parameter parametric-only model.

For generation, RAG improves factuality rather than only extractive accuracy. On Jeopardy question generation, human evaluators judged BART more factual than RAG in only 7.1% of paired cases, while RAG was more factual in 42.7%, with both factual in another 17%. RAG also improves diversity: the distinct trigram ratio for Jeopardy generation is 46.8% for RAG-Token and 53.8% for RAG-Sequence, versus 32.4% for BART.

The clearest evidence for CPU memory as an editable device primitive is the world-leader index swap. With matching index and query year, RAG answers 70% correctly for 2016 leaders and 68% for 2018 leaders. With mismatched indices, accuracy drops to 12% and 4%. Knowledge can be updated by replacing non-parametric memory, not by retraining BART.

Historical Effect

RAG made retrieval an explicit compute-device primitive for neural generation. Instead of treating factual knowledge as something that must be stored in accelerator-resident parameters, it separated a large CPU memory from a GPU generator and trained the model to condition on retrieved text. That separation shaped later LLM systems: vector databases, retrieval pipelines, and generator reranking became ordinary ways to spend CPU memory and inference latency instead of only increasing parameter count.

The historical effect is not just "retrieval improves QA." The paper demonstrated a practical architecture where a medium-size generator plus a huge external memory could compete with much larger closed-book models on knowledge-intensive tasks. It also showed that knowledge freshness is a systems property: swapping an index can change answers without changing neural weights.

Limits

The device split creates its own limits. CPU FAISS search is reported as sufficiently fast, but top-K retrieval still adds latency and an additional memory subsystem to every query. The 100GB uncompressed index, or 36GB compressed index, is smaller than a huge GPU-resident model but still a substantial serving object. Keeping the document encoder fixed avoids rebuilds, but it also prevents the document representation from adapting to downstream generation losses.

Quality remains retrieval-bound. If the relevant passage is absent, misranked, stale, or not in Wikipedia, the generator cannot reliably recover. The paper also notes that some MS-MARCO questions cannot be answered from Wikipedia alone. RAG reduces hallucination relative to BART in the reported tasks, but it does not eliminate it, and it moves some failure modes from neural generation into corpus coverage, chunking, indexing, and retrieval calibration.