ML & AI Engineering Interviews · 18 min read

LLM Serving: Latency, Batching, and Cost per Token in Interviews

Interviewers want production sense — not a diagram of every GPU feature.

3,630 words

LLM Serving: Latency, Batching, and Cost per Token in Interviews. Interviewers want production sense — not a diagram of every GPU feature. This long-form guide sits in the Alpha Code library because interview prep should feel structured, not superstitious: we anchor advice to what loops actually measure, how time pressure distorts judgment, and how to rehearse behaviors that stay stable under stress. You will find six concrete chapters below, each with checklists and recovery patterns you can reuse across companies and levels. We wrote it for candidates who already know the basics but want a disciplined narrative — the kind of document you can skim before a phone screen and deep-read before an onsite. Expect explicit tradeoffs, not cheerleading: some strategies cost time, some require partners, and some only make sense at certain seniority bands. If a section does not apply to your target loop, skip it without guilt; the goal is optionality, not completionism. By the end, you should be able to describe your prep plan to a mentor in five minutes and sound like you have a system, not a pile of bookmarks.

serving topology — what interviewers measure in the first five minutes

This section focuses on serving topology — what interviewers measure in the first five minutes. Candidates preparing for LLM Serving often underestimate how much interviewers infer from process: how you decompose the prompt, name tradeoffs, and verify before you optimize. The behaviors that look boring — restating constraints, proposing a baseline, testing a tiny example — are exactly what separates hire from no-hire when two solutions have similar asymptotics. We connect this theme to what hiring committees actually write in feedback forms, not abstract advice. Treat the next paragraphs as a script you can steal: say the quiet parts out loud, label your invariants, and narrate recovery when you misread a constraint. Practice until it feels mechanical, because stress will strip your polish unless the habits are automatic.

The best prep materials are the ones you will actually use. A perfect curriculum that you abandon after four days loses to a decent curriculum you finish. Optimize for adherence: shorter sessions you can repeat, frictionless environments, and clear win conditions each session. Track streaks lightly — consistency beats intensity spikes that vanish after finals week.

Feature stores and training pipelines bridge offline experimentation and online serving. Training-serving skew is a frequent source of silent degradation — discuss schema validation and monitoring for distribution shift.

Behavioral answers rot without maintenance. Stories should be refreshed every six to twelve months with new metrics and clearer scope. The STAR format is a scaffold, not a script — senior interviewers want to hear how you prioritized, what you learned, and what you would do differently. Keep a one-page story bank with bullets, not paragraphs, so you can assemble answers live without sounding rehearsed.

The best onsite performances look boring from the outside: clear steps, explicit assumptions, and a solution that actually finishes.
Composite feedback from mock interview coaches
  • Restate the heart of "serving topology — what interviewers measure in the first five minutes" and confirm inputs, outputs, and edge cases.
  • Propose a brute-force or baseline you can finish — name its complexity honestly.
  • Walk a hand trace on a small example; only then refactor toward the optimal structure.
  • Reserve the final minutes for tests: null/empty, duplicates, extremes, and off-by-one boundaries.
  • Close with a one-sentence summary of tradeoffs and what you would monitor in production.

Feature stores and training pipelines bridge offline experimentation and online serving. Training-serving skew is a frequent source of silent degradation — discuss schema validation and monitoring for distribution shift.

The best prep materials are the ones you will actually use. A perfect curriculum that you abandon after four days loses to a decent curriculum you finish. Optimize for adherence: shorter sessions you can repeat, frictionless environments, and clear win conditions each session. Track streaks lightly — consistency beats intensity spikes that vanish after finals week.

First moves: framing batching dynamics before you reach for code

This section focuses on First moves: framing batching dynamics before you reach for code. Candidates preparing for LLM Serving often underestimate how much interviewers infer from process: how you decompose the prompt, name tradeoffs, and verify before you optimize. The behaviors that look boring — restating constraints, proposing a baseline, testing a tiny example — are exactly what separates hire from no-hire when two solutions have similar asymptotics. We connect this theme to what hiring committees actually write in feedback forms, not abstract advice. Treat the next paragraphs as a script you can steal: say the quiet parts out loud, label your invariants, and narrate recovery when you misread a constraint. Practice until it feels mechanical, because stress will strip your polish unless the habits are automatic.

Communication is a first-class deliverable. Even solo coding rounds are graded partly on whether a hiring manager could follow your reasoning six months later from notes. That means naming variables honestly, stating assumptions explicitly, and checking in before you disappear into twenty minutes of silence. If you are remote, narrate a little more than feels natural — the interviewer cannot see your facial cues.

RAG systems combine retrieval quality with generation safety. Chunking strategy, embedding model choice, rerankers, and citation policies all affect user trust. Be ready to discuss what happens when retrieved context is wrong — grounding and abstention strategies matter.

Interview prep is not a single skill. It is a portfolio of habits: pattern recognition under time pressure, clear verbalization of tradeoffs, and the ability to recover when you misunderstand a constraint. The candidates who feel calm in the room are not necessarily smarter; they have rehearsed the shape of the conversation until novelty feels familiar. That rehearsal should be deliberate — timed blocks, recorded explanations, and post-mortems that name what broke down instead of hand-waving as nerves.

  • Restate the heart of "First moves: framing batching dynamics before you reach for code" and confirm inputs, outputs, and edge cases.
  • Propose a brute-force or baseline you can finish — name its complexity honestly.
  • Walk a hand trace on a small example; only then refactor toward the optimal structure.
  • Reserve the final minutes for tests: null/empty, duplicates, extremes, and off-by-one boundaries.
  • Close with a one-sentence summary of tradeoffs and what you would monitor in production.

RAG systems combine retrieval quality with generation safety. Chunking strategy, embedding model choice, rerankers, and citation policies all affect user trust. Be ready to discuss what happens when retrieved context is wrong — grounding and abstention strategies matter.

Communication is a first-class deliverable. Even solo coding rounds are graded partly on whether a hiring manager could follow your reasoning six months later from notes. That means naming variables honestly, stating assumptions explicitly, and checking in before you disappear into twenty minutes of silence. If you are remote, narrate a little more than feels natural — the interviewer cannot see your facial cues.

MomentWhat to say
StartI'll restate the goal, then propose a baseline I can complete in time.
MidpointHere's the invariant I'm maintaining — I'll verify it on the example.
StuckI'm stuck on X; I'll try a smaller case and see what breaks.
EndI'll run these edge cases, then summarize complexity and tradeoffs.

Tradeoffs, pitfalls, and honest complexity around kv cache

This section focuses on Tradeoffs, pitfalls, and honest complexity around kv cache. Candidates preparing for LLM Serving often underestimate how much interviewers infer from process: how you decompose the prompt, name tradeoffs, and verify before you optimize. The behaviors that look boring — restating constraints, proposing a baseline, testing a tiny example — are exactly what separates hire from no-hire when two solutions have similar asymptotics. We connect this theme to what hiring committees actually write in feedback forms, not abstract advice. Treat the next paragraphs as a script you can steal: say the quiet parts out loud, label your invariants, and narrate recovery when you misread a constraint. Practice until it feels mechanical, because stress will strip your polish unless the habits are automatic.

Depth beats breadth when calendars are tight. Ten problems solved three times each — once for speed, once for explanation, once from a blank file — beats thirty problems skimmed once. The third pass is where pattern recognition becomes automatic. Use a simple rubric after each session: what pattern was this, where did I hesitate, and what one drill would remove that hesitation next time.

Feature stores and training pipelines bridge offline experimentation and online serving. Training-serving skew is a frequent source of silent degradation — discuss schema validation and monitoring for distribution shift.

Time management is where strong candidates lose offers. You do not get partial credit for a perfect approach you never finished. A working solution that passes tests beats an elegant idea that lives only on the whiteboard. Practice cutting scope early: start with brute force if it clarifies invariants, then tighten. Interviewers often prefer a clean linear scan plus verbalized next steps over a half-written optimal algorithm.

  • Restate the heart of "Tradeoffs, pitfalls, and honest complexity around kv cache" and confirm inputs, outputs, and edge cases.
  • Propose a brute-force or baseline you can finish — name its complexity honestly.
  • Walk a hand trace on a small example; only then refactor toward the optimal structure.
  • Reserve the final minutes for tests: null/empty, duplicates, extremes, and off-by-one boundaries.
  • Close with a one-sentence summary of tradeoffs and what you would monitor in production.

Feature stores and training pipelines bridge offline experimentation and online serving. Training-serving skew is a frequent source of silent degradation — discuss schema validation and monitoring for distribution shift.

Depth beats breadth when calendars are tight. Ten problems solved three times each — once for speed, once for explanation, once from a blank file — beats thirty problems skimmed once. The third pass is where pattern recognition becomes automatic. Use a simple rubric after each session: what pattern was this, where did I hesitate, and what one drill would remove that hesitation next time.

When quantization goes sideways: recovery scripts that still score

This section focuses on When quantization goes sideways: recovery scripts that still score. Candidates preparing for LLM Serving often underestimate how much interviewers infer from process: how you decompose the prompt, name tradeoffs, and verify before you optimize. The behaviors that look boring — restating constraints, proposing a baseline, testing a tiny example — are exactly what separates hire from no-hire when two solutions have similar asymptotics. We connect this theme to what hiring committees actually write in feedback forms, not abstract advice. Treat the next paragraphs as a script you can steal: say the quiet parts out loud, label your invariants, and narrate recovery when you misread a constraint. Practice until it feels mechanical, because stress will strip your polish unless the habits are automatic.

Behavioral answers rot without maintenance. Stories should be refreshed every six to twelve months with new metrics and clearer scope. The STAR format is a scaffold, not a script — senior interviewers want to hear how you prioritized, what you learned, and what you would do differently. Keep a one-page story bank with bullets, not paragraphs, so you can assemble answers live without sounding rehearsed.

Feature stores and training pipelines bridge offline experimentation and online serving. Training-serving skew is a frequent source of silent degradation — discuss schema validation and monitoring for distribution shift.

Data structures are not Pokemon; you do not collect them for their own sake. You pick the structure that makes the operations your algorithm needs cheap. If you need fast membership and order does not matter, a set or map is the conversation. If you need order statistics, heaps or balanced trees enter. If the problem is about connectivity, graphs are near. Practice explaining that mapping in one sentence before you write code.

The best onsite performances look boring from the outside: clear steps, explicit assumptions, and a solution that actually finishes.
Composite feedback from mock interview coaches
  • Restate the heart of "When quantization goes sideways: recovery scripts that still score" and confirm inputs, outputs, and edge cases.
  • Propose a brute-force or baseline you can finish — name its complexity honestly.
  • Walk a hand trace on a small example; only then refactor toward the optimal structure.
  • Reserve the final minutes for tests: null/empty, duplicates, extremes, and off-by-one boundaries.
  • Close with a one-sentence summary of tradeoffs and what you would monitor in production.

Feature stores and training pipelines bridge offline experimentation and online serving. Training-serving skew is a frequent source of silent degradation — discuss schema validation and monitoring for distribution shift.

Behavioral answers rot without maintenance. Stories should be refreshed every six to twelve months with new metrics and clearer scope. The STAR format is a scaffold, not a script — senior interviewers want to hear how you prioritized, what you learned, and what you would do differently. Keep a one-page story bank with bullets, not paragraphs, so you can assemble answers live without sounding rehearsed.

A two-week drill plan with milestones tied to cost modeling

This section focuses on A two-week drill plan with milestones tied to cost modeling. Candidates preparing for LLM Serving often underestimate how much interviewers infer from process: how you decompose the prompt, name tradeoffs, and verify before you optimize. The behaviors that look boring — restating constraints, proposing a baseline, testing a tiny example — are exactly what separates hire from no-hire when two solutions have similar asymptotics. We connect this theme to what hiring committees actually write in feedback forms, not abstract advice. Treat the next paragraphs as a script you can steal: say the quiet parts out loud, label your invariants, and narrate recovery when you misread a constraint. Practice until it feels mechanical, because stress will strip your polish unless the habits are automatic.

Data structures are not Pokemon; you do not collect them for their own sake. You pick the structure that makes the operations your algorithm needs cheap. If you need fast membership and order does not matter, a set or map is the conversation. If you need order statistics, heaps or balanced trees enter. If the problem is about connectivity, graphs are near. Practice explaining that mapping in one sentence before you write code.

Safety and policy layers are increasingly interview topics: prompt injection, PII handling, and moderation. You do not need a perfect taxonomy — you need to show you think about failure modes beyond accuracy.

Rubrics differ by level. Junior loops emphasize implementation correctness and learning speed. Mid-level loops add system reasoning and collaboration. Senior-plus loops trade some coding intensity for scope, ambiguity, and multi-team tradeoffs. If you are preparing for a Staff loop with only LeetCode hards, you are misaligned. If you are preparing for an L4 coding screen with only architecture blog posts, you are also misaligned. Match the tool to the level.

  • Restate the heart of "A two-week drill plan with milestones tied to cost modeling" and confirm inputs, outputs, and edge cases.
  • Propose a brute-force or baseline you can finish — name its complexity honestly.
  • Walk a hand trace on a small example; only then refactor toward the optimal structure.
  • Reserve the final minutes for tests: null/empty, duplicates, extremes, and off-by-one boundaries.
  • Close with a one-sentence summary of tradeoffs and what you would monitor in production.

Safety and policy layers are increasingly interview topics: prompt injection, PII handling, and moderation. You do not need a perfect taxonomy — you need to show you think about failure modes beyond accuracy.

Data structures are not Pokemon; you do not collect them for their own sake. You pick the structure that makes the operations your algorithm needs cheap. If you need fast membership and order does not matter, a set or map is the conversation. If you need order statistics, heaps or balanced trees enter. If the problem is about connectivity, graphs are near. Practice explaining that mapping in one sentence before you write code.

Day-of checklist: rollback, timeboxing, and how to close strong

This section focuses on Day-of checklist: rollback, timeboxing, and how to close strong. Candidates preparing for LLM Serving often underestimate how much interviewers infer from process: how you decompose the prompt, name tradeoffs, and verify before you optimize. The behaviors that look boring — restating constraints, proposing a baseline, testing a tiny example — are exactly what separates hire from no-hire when two solutions have similar asymptotics. We connect this theme to what hiring committees actually write in feedback forms, not abstract advice. Treat the next paragraphs as a script you can steal: say the quiet parts out loud, label your invariants, and narrate recovery when you misread a constraint. Practice until it feels mechanical, because stress will strip your polish unless the habits are automatic.

Rubrics differ by level. Junior loops emphasize implementation correctness and learning speed. Mid-level loops add system reasoning and collaboration. Senior-plus loops trade some coding intensity for scope, ambiguity, and multi-team tradeoffs. If you are preparing for a Staff loop with only LeetCode hards, you are misaligned. If you are preparing for an L4 coding screen with only architecture blog posts, you are also misaligned. Match the tool to the level.

Latency budgets split between retrieval, reranking, and model inference. Caching embeddings, approximate nearest neighbors, and smaller student models are standard mitigations. Cost per query belongs in the same sentence as latency when traffic is high.

Communication is a first-class deliverable. Even solo coding rounds are graded partly on whether a hiring manager could follow your reasoning six months later from notes. That means naming variables honestly, stating assumptions explicitly, and checking in before you disappear into twenty minutes of silence. If you are remote, narrate a little more than feels natural — the interviewer cannot see your facial cues.

  • Restate the heart of "Day-of checklist: rollback, timeboxing, and how to close strong" and confirm inputs, outputs, and edge cases.
  • Propose a brute-force or baseline you can finish — name its complexity honestly.
  • Walk a hand trace on a small example; only then refactor toward the optimal structure.
  • Reserve the final minutes for tests: null/empty, duplicates, extremes, and off-by-one boundaries.
  • Close with a one-sentence summary of tradeoffs and what you would monitor in production.

Latency budgets split between retrieval, reranking, and model inference. Caching embeddings, approximate nearest neighbors, and smaller student models are standard mitigations. Cost per query belongs in the same sentence as latency when traffic is high.

Rubrics differ by level. Junior loops emphasize implementation correctness and learning speed. Mid-level loops add system reasoning and collaboration. Senior-plus loops trade some coding intensity for scope, ambiguity, and multi-team tradeoffs. If you are preparing for a Staff loop with only LeetCode hards, you are misaligned. If you are preparing for an L4 coding screen with only architecture blog posts, you are also misaligned. Match the tool to the level.

MomentWhat to say
StartI'll restate the goal, then propose a baseline I can complete in time.
MidpointHere's the invariant I'm maintaining — I'll verify it on the example.
StuckI'm stuck on X; I'll try a smaller case and see what breaks.
EndI'll run these edge cases, then summarize complexity and tradeoffs.

Stop grinding. Start patterning.

Alpha Code is a patterns-first interview prep platform — coding, system design, behavioral, mocks, and ML/AI engineering all under one $19/mo subscription.