Design a RAG System at Scale — System Design

Retrieval-Augmented Generation (RAG) is how you make an LLM answer questions over your data — private docs, a knowledge base, last week's tickets — without retraining it. The idea is simple: before calling the model, retrieve the most relevant text and put it in the prompt so the answer is grounded in real sources instead of the model's frozen, fuzzy memory. But "simple" hides a real distributed system: an offline pipeline that turns documents into a searchable vector index, and an online pipeline that, per query, embeds, searches, reranks, assembles a prompt, and generates a cited answer — all fast, fresh, and cheap. This piece designs both ends and the tradeoffs an interviewer will probe.

⚡ Quick Takeaways

RAG is two pipelines: an offline indexing pipeline (load → chunk → embed → index) and an online query pipeline (embed → retrieve → rerank → generate). Design them separately.
Chunking is the highest-leverage knob. Too big and retrieval is noisy; too small and you lose context. Split on structure with overlap, and keep a pointer back to the source.
The vector index uses approximate nearest neighbor (ANN) — HNSW or IVF — trading a little recall for sub-linear search over millions of vectors.
Hybrid retrieval beats pure vector. Combine semantic (embeddings) with keyword (BM25) and fuse the scores; add metadata filters for tenancy and freshness.
Rerank the candidates. A cross-encoder reranks the top ~100 down to the top ~5 you actually put in the prompt — the cheapest big quality win.
Generation must be grounded and cited. Instruct the model to answer only from the retrieved context and to cite chunks, so you can show sources and catch hallucination.
Quality is measured, not assumed. Evaluate retrieval (recall@k) and answer faithfulness/relevance — the "RAG triad" — continuously.

tldr

Offline, you load documents, split them into overlapping chunks, embed each chunk, and upsert the vectors into an ANN index alongside metadata. Online, you embed the query, run hybrid (vector + keyword) search with metadata filters, rerank the candidates with a cross-encoder, assemble the top chunks into a grounded prompt, and have the LLM answer with citations. The hard parts are chunking, hybrid + rerank quality, freshness/incremental indexing, and evaluating that retrieval actually helped.

RAG = two pipelines over one index — offline, documents are chunked, embedded, and upserted; online, a query is embedded, hybrid-retrieved, reranked, and answered with grounded citations

Step 1 — Clarify the Requirements

Scope it: are we building a Q&A assistant over a corpus (support docs, internal wiki, legal contracts)? How big is the corpus, how fresh must it be, and is it multi-tenant? A focused set:

Functional requirements

Ingest documents of mixed formats (PDF, HTML, Markdown, database rows) and keep the index updated as they change.
Answer a natural-language query with a response grounded in retrieved sources, returning citations.
Support metadata filters (tenant, document type, date, access control) on retrieval.
Say "I don't know" when nothing relevant is retrieved rather than hallucinating.

Non-functional requirements

Low query latency — retrieval in tens of milliseconds; end-to-end dominated by the LLM call.
Freshness — new/edited documents searchable within minutes, not a nightly rebuild.
Scale — tens to hundreds of millions of chunks; thousands of queries/sec.
Correctness & safety — never leak one tenant's documents to another; ground answers to limit hallucination.

interview tip

State the framing early: RAG doesn't change the model, it changes the prompt. Every design decision is about getting the right chunks into a finite context window — so retrieval quality, not the LLM, is what you're really engineering.

Step 2 — The Two Pipelines

The cleanest mental model — and the structure interviewers want — is two independent pipelines that meet at the vector index. The indexing pipeline runs offline (and incrementally), turning documents into searchable vectors. The query pipeline runs online per request. Decoupling them means you can re-index, swap embedding models, or re-chunk without touching the serving path, and scale each on its own profile (indexing is throughput-bound and batchy; querying is latency-bound).

Step 3 — Ingestion and Chunking

The indexing pipeline starts by loading and parsing heterogeneous sources into clean text (plus structure — headings, tables, page numbers — which you keep as metadata). Then comes the single most consequential choice in RAG: chunking.

Why chunking decides quality

You embed and retrieve chunks, not whole documents, because an embedding compresses a span of text into one vector — and the longer the span, the blurrier that vector. Chunk too large and a query matches a vaguely-related page, dragging in noise; chunk too small and a chunk loses the context needed to be meaningful. The sweet spot is usually a few hundred tokens, split on natural structure (paragraphs, headings, sections) rather than blind fixed-size cuts, with a small overlap between chunks so a sentence spanning a boundary isn't orphaned.

chunk → embed → upsert (indexing)

for doc in source.stream():
    text   = parse(doc)
    chunks = split(text, size=400, overlap=60, on="headings")
    vecs   = embedder.embed(chunks)          # batched on GPU/accelerator
    index.upsert([{
        "id": hash(doc.id, i), "vector": v,
        "text": c, "doc_id": doc.id, "tenant": doc.tenant,
        "updated_at": doc.updated_at        # metadata = filters + citations
    } for i,(c,v) in enumerate(zip(chunks, vecs))])

Note what's stored alongside each vector: the original text (so you can put it in the prompt and cite it) and metadata (tenant, doc id, timestamp) that powers filtering and freshness. The embedding model choice matters too — pick one and keep it fixed, because changing it means re-embedding the entire corpus (the vectors must live in the same space to be comparable).

Step 4 — The Vector Index

Retrieval is a nearest-neighbor search in embedding space: find the chunk vectors closest (by cosine similarity) to the query vector. Exact search over hundreds of millions of vectors is too slow, so production systems use Approximate Nearest Neighbor (ANN) indexes that trade a sliver of recall for sub-linear query time.

ANN index	Tradeoff
HNSW (graph)	Excellent recall/latency, fast queries; higher memory and slower builds. The common default.
IVF / IVF-PQ	Cluster then search a few cells; PQ compresses vectors to cut memory at some recall cost. Good for huge corpora.
Flat (exact)	Perfect recall, brute force — fine for small/medium sets, doesn't scale to 100M+.

Whether you use a dedicated vector DB (Pinecone, Weaviate, Qdrant, Milvus) or pgvector on Postgres, the same scaling concerns apply: vectors are large and the index often lives in memory, so you shard by document/tenant across nodes and replicate for availability and read throughput. A query fans out to shards and merges the top results. Sizing rule of thumb: a 768-dim float32 vector is ~3KB, so 100M chunks ≈ 300GB of raw vectors before index overhead — which is why quantization (PQ, or int8) matters at scale.

Step 5 — Retrieval: Go Hybrid

Pure vector search is great at semantic matching ("how do I reset my password" finds "account recovery steps") but weak at exact terms — error codes, product SKUs, rare names — where the embedding blurs the very token that matters. The fix is hybrid retrieval: run vector search and a classic keyword search (BM25) in parallel, then fuse the rankings (e.g. Reciprocal Rank Fusion). You also apply metadata filters here — tenant and access control (non-negotiable for correctness), plus date or document-type constraints.

Hybrid retrieval — vector (semantic) and BM25 (exact-term) results are fused into ~100 candidates, then a cross-encoder reranks them down to the ~5 chunks that go in the prompt

Step 6 — Reranking

Retrieval optimizes for recall: cast a wide net and pull ~100 candidate chunks cheaply. But you can only afford to put a handful in the prompt, and the order/precision of those few drives answer quality. So add a reranking stage: a cross-encoder model scores each (query, chunk) pair jointly — far more accurate than the bi-encoder embeddings used for first-stage retrieval, but too expensive to run over the whole corpus, which is exactly why it runs only on the ~100 candidates. This two-stage "retrieve wide, rerank narrow" pattern is the cheapest large quality gain in RAG, and a strong thing to volunteer in an interview.

Step 7 — Grounded Generation

Now assemble the prompt: the user's question, the top reranked chunks (each tagged with a source id), and an instruction to answer only from the provided context and cite which chunks were used — or say it doesn't know if the context is insufficient. This grounding is what turns an LLM from a confident guesser into a sourced assistant, and the citations let users verify and let you measure faithfulness.

a grounded prompt

# system
Answer ONLY using the context below. Cite sources as [n].
If the context doesn't contain the answer, say you don't know.

# context  (top reranked chunks)
[1] (doc: billing-faq#refunds) "Refunds are issued within 5–7 days…"
[2] (doc: policy-v3#cancel)     "To cancel, go to Settings → Billing…"

# user
How long do refunds take, and how do I cancel?

Two budget decisions live here: how many chunks to include (more context can help but adds tokens, latency, and the risk of the model getting "lost in the middle" of a long context), and what to do on weak retrieval — if the top scores are low, it's better to return "no good answer found" than to force the model to confabulate.

Step 8 — Freshness and Incremental Indexing

A corpus is never static, and a nightly full rebuild is both slow and stale. The serving path needs an index that updates continuously, so wire ingestion to change events: a document created/updated/deleted emits an event (via CDC from the source DB or a webhook), which flows through the same chunk → embed → upsert path, and a delete removes that document's chunks. The result is an index that's eventually consistent within minutes. Keep an updated_at on every chunk so you can filter to fresh content and reconcile/garbage-collect stale chunks.

key point

Treat the index as a materialized, replayable projection of your documents — exactly like a search index. The source of truth is the documents; the vector index is derived, so you can always re-chunk or re-embed and rebuild it. That mindset makes embedding-model upgrades and chunking changes a re-index job, not a migration crisis.

Step 9 — Evaluation: the RAG Triad

You can't improve what you don't measure, and RAG fails in two distinct places — retrieval and generation — so evaluate both. The common framing is the RAG triad:

Context relevance — did retrieval return chunks actually relevant to the query? (Measure recall@k / precision against a labeled set.)
Faithfulness (groundedness) — is the answer supported by the retrieved chunks, or did the model make something up?
Answer relevance — does the answer actually address the question?

Build a small golden set of (question → ideal sources/answer) pairs for offline regression, and use an LLM-as-judge to score faithfulness and relevance at scale (see evals for LLM apps). Capture production signals too — thumbs up/down, "was this cited source correct" — and feed them back. A retrieval bug (wrong chunks) and a generation bug (ignored the chunks) need different fixes, and only separate metrics tell them apart.

Step 10 — Scaling and Cost

Three cost centers dominate, and each has a lever:

Cost center	Lever
Embedding (indexing)	Batch on accelerators; only re-embed changed chunks; cache embeddings keyed by content hash.
Index memory	Quantize vectors (PQ / int8); shard across nodes; tier cold data to disk-backed ANN.
Generation (per query)	Fewer, better chunks (rerank) → shorter prompts; cache answers for repeat queries; small model for easy ones.

On latency, retrieval + rerank is typically tens of milliseconds; the LLM call dominates end-to-end time, so the same streaming and caching tricks from a chat assistant apply. Cache aggressively: query embeddings, retrieval results for hot queries, and full answers for exact repeats.

Step 11 — Failure Modes and Tradeoffs

Retrieved the wrong thing → wrong answer. Garbage chunks in, confident garbage out. This is why hybrid + rerank + eval matter more than the LLM choice.
Chunk size is a genuine tradeoff. No single size is optimal; tune it against your eval set, and consider storing small chunks but expanding to their surrounding context at generation time.
Stale index. Without incremental indexing, answers cite deleted or outdated docs — a correctness and trust problem; CDC-driven updates and TTL/GC are the fix.
Lost in the middle. Stuffing many chunks can hurt — models attend less to the middle of a long context; fewer, reranked chunks often beat more chunks.
Tenant leakage. A filter bug that returns another tenant's chunks is a security incident; enforce access filters in the query, not just in the UI.
It still hallucinates. Grounding reduces but doesn't eliminate it; citations + faithfulness evals + "say I don't know" are the guardrails.

takeaway

RAG is a retrieval system with an LLM stapled to the end — so spend your design effort on retrieval. Two pipelines (offline index, online query), structure-aware chunking with overlap, an ANN index you shard and quantize, hybrid (vector + BM25) retrieval with metadata filters, a cross-encoder rerank, and grounded-with-citations generation. Then close the loop with the RAG triad evals, because the only way to know retrieval helped is to measure it.

🎯 interview hot-takes

What is RAG, in one line? Retrieve relevant text and put it in the prompt so the LLM answers from your data — changing the prompt, not the model.
Why chunk, and how? You embed/retrieve spans, and longer spans blur the vector; split on structure (~a few hundred tokens) with overlap, keeping source metadata.
Why hybrid retrieval? Vectors nail semantics but miss exact terms (codes, names); add BM25 and fuse — then rerank the candidates with a cross-encoder.
Why a reranker? First-stage retrieval maximizes recall cheaply; a cross-encoder precisely reorders the top ~100 → top ~5 — the biggest cheap quality win.
How do you keep it fresh? CDC/webhook-driven incremental indexing through the same chunk→embed→upsert path; the index is a replayable projection of the docs.
How do you evaluate it? The RAG triad — context relevance (recall@k), faithfulness, and answer relevance — on a golden set, with LLM-as-judge at scale.