ML & AI Engineering Interviews · 11 min read

ML System Design Interview: A RAG Pipeline, End to End, in 45 Minutes

How to design a retrieval-augmented generation system at L5/L6 depth without losing the room.

1,534 words

ML system design interviews look like classic system design interviews wearing a transformer hat. The framework is the same — requirements, capacity, API, data, components, deep dives, tradeoffs — but the components are unfamiliar to candidates who learned distributed systems before LLMs were a thing. This article walks through the canonical question — design a retrieval-augmented generation system — at L5/L6 depth, with the actual numbers, the actual component choices, and the actual evaluation strategy that interviewers at AI-first companies are probing for.

The prompt: 'Design a RAG system for product documentation Q&A'

The interviewer drops this prompt and watches the first thirty seconds. Weak candidates jump to vector databases. Strong candidates ask about the corpus size, the freshness requirement, the latency budget, and the failure mode for an answer the system is unsure of. The questions you ask in the first three minutes are most of the rubric for an ML system design loop.

For this walkthrough assume: 200,000 documents, average 20 KB each (≈4 GB raw text); updates daily; P95 user-visible latency under 2 seconds; English only; cited sources required; budget ≤ $0.02 per query at expected 50 QPS peak. These constraints drive every component choice that follows.

The high-level pipeline

A production RAG pipeline has two halves — an offline indexing pipeline and an online query pipeline — connected by a vector index. Drawing both halves on the board immediately is a senior signal. Most candidates draw only the online half and miss half the rubric.

  • Offline: ingest → clean → chunk → embed → upsert into vector index. Runs nightly or on document change.
  • Online: user query → embed → retrieve top-k → re-rank → assemble context → call LLM → cite sources → return.
  • Cross-cutting: feature store for query/response logs, eval harness on a held-out gold set, monitoring on retrieval quality and generation quality.

Walk both halves before drilling into any component. The interviewer needs to see that you understand the index is not a magic box — it is the contract between two pipelines that have very different latency and cost profiles.

Chunking: the most under-discussed lever

How you split documents into chunks determines retrieval quality more than which embedding model you pick. Too small and chunks lose context (a sentence about pricing without the surrounding section header is useless). Too large and the embedding becomes a smear that does not match any specific query.

StrategyWhen to useTradeoff
Fixed-size (e.g. 512 tokens)Homogeneous corpus, fast to shipSplits mid-thought, hurts coherence
Recursive by separatorsMixed prose, default choiceVariable chunk size, harder to budget context window
Section-aware (heading-based)Structured docs (manuals, wikis)Requires reliable structure parsing
Semantic (embedding-similarity splits)High-value corpus, expensive offlineNeeds an embedding pass per candidate split

Embeddings and vector index choice

Pick the embedding model and the index together — they have to agree on dimensionality and distance metric. For 200k chunks (assume 4 chunks per doc, so 800k chunks), a 1024-dimensional embedding gives you about 3.3 GB of float32 vectors, or 800 MB at int8 quantization. Both fit comfortably in a single-node HNSW index.

IndexRecall vs latencyWhen to pick it
FAISS FlatExact, slow at scale≤100k vectors, eval baseline
HNSWHigh recall, ~5-50ms at 1M scaleMost production RAG
IVF + PQLower recall, smaller memory footprint≥10M vectors, memory-bound
pgvector (HNSW)Co-located with relational dataWhen metadata filtering is critical

Name the cost: a managed vector DB at this scale is roughly $200-500/month. A self-hosted HNSW on a single beefy box is $50-100/month plus your operational overhead. For 50 QPS peak with cited sources requirement, the answer is almost always 'managed for v1, evaluate self-host for v2.' Saying that out loud is the senior tradeoff articulation interviewers are scoring.

Retrieval and re-ranking

First-stage retrieval pulls the top-k (typically k=20-50) by vector similarity. This is fast but coarse. The second stage — re-ranking — takes those candidates and scores them with a more expensive cross-encoder that sees both the query and the candidate together. Re-ranking is where retrieval quality jumps the most. Without it, you ship a system that finds 'somewhat related' passages and confuses the LLM.

ts
// Online query path, simplified.
const queryVec = await embed(query);

const candidates = await vectorIndex.search(queryVec, {
  k: 30,
  filter: { lang: "en", visibility: "public" },
});

const reranked = await reranker.score(query, candidates.map(c => c.text));
const topPassages = reranked.slice(0, 5);

const answer = await llm.generate({
  system: SYSTEM_PROMPT,
  context: topPassages.map(p => p.text).join("\n---\n"),
  question: query,
  cite: topPassages.map(p => p.id),
});

Generation, citations, and refusal behavior

The LLM call is the most expensive component per query and the only one with non-deterministic output. Pin the model, pin the temperature (0 for factual, 0.2 for explanatory), pin the max output tokens. Engineer the prompt to require citations as inline tags and a JSON tail listing the cited passage IDs — that lets the application layer verify every claim against retrieved context, and refuse if the model cited a passage you did not actually retrieve.

  1. 1If retrieval returned zero high-confidence passages (top reranker score below threshold), refuse and surface the closest match as a suggestion — do not let the model hallucinate.
  2. 2Strip and validate citations against the retrieval set before returning to the user.
  3. 3Cap context window usage at 60-70% of model max — leave room for the generated answer and the system prompt.
  4. 4Log every (query, retrieved IDs, response, citations) tuple — this is the eval-harness ground truth.

Evaluation: the part most candidates skip

An ML system design answer that does not include an eval harness is incomplete. The interviewer is waiting for it. RAG eval is two stages: retrieval eval (do we retrieve the gold passage in top-k?) measured by recall@k and MRR; and generation eval (is the answer faithful and complete?) measured by LLM-as-judge against a held-out gold set, with periodic human spot-checks.

MetricStageTarget
Recall@5Retrieval≥0.85
MRRRetrieval≥0.6
FaithfulnessGeneration≥0.95
Answer relevanceGeneration≥0.85
Refusal rate on out-of-corpus questionsEnd-to-end≥0.9
I have rejected senior candidates who designed beautiful pipelines and could not tell me how they would know if the pipeline was getting better. The eval harness is the system. Everything else is plumbing.
Engineering manager, AI infrastructure team

The operational story: cost, latency, and rollout

Most ML system design candidates underweight the operational story. The interviewer wants to hear how this system gets to production, how you would roll it out behind a feature flag, how you would measure quality after launch, and what would page someone at three in the morning. Without that conversation, the design reads as a research prototype.

  1. 1Rollout: launch behind a feature flag at one percent of traffic. Compare retrieval-only baseline against the new pipeline in an A/B. Promote at five, twenty-five, and finally one hundred percent over two weeks.
  2. 2Cost monitoring: dashboard cost-per-query in real time, alert on percentile spikes, kill switch on per-tenant cost overruns.
  3. 3Latency monitoring: P50, P95, P99 broken down by stage (embed, retrieve, rerank, generate). The hot path is almost always the LLM call; pinning a smaller fallback model is the relief valve.
  4. 4Quality monitoring: log every query with retrieved IDs, generated answer, and citation set. Sample 1% for human review weekly. LLM-as-judge runs nightly on the held-out gold set.

The cost story for the example pipeline (50 QPS peak, 800k chunks, 1024-dim embeddings, mid-tier reranker, frontier LLM): roughly 0.012 dollars per query with the small-model fallback active, 0.018 dollars with the frontier model on the critical path. At one million queries per month that is 12,000-18,000 dollars in inference cost — comparable to the entire compute budget for a small product line. Naming the budget out loud is the kind of grounded business awareness that lifts a Staff+ rubric score.

The final operational call: have an opinion on what the team building this should look like. One ML engineer, one platform engineer, and a half-time PM is the minimum for a serious production deployment. Without naming the team shape, you are designing a system that has no humans behind it — interviewers notice.

Stop grinding. Start patterning.

Alpha Code is a patterns-first interview prep platform — coding, system design, behavioral, mocks, and ML/AI engineering all under one $19/mo subscription.