"Design ChatGPT" has quietly become one of the most common system design interview questions, and it's a great one: it looks like a chat app, but the interesting parts are nothing like one. The response isn't a row you read from a database — it's generated token by token by a model running on a GPU, costs real money per request, takes seconds (not milliseconds), and can fail or misbehave in ways a CRUD service never does. This piece walks the whole design end to end: gathering requirements, sizing the load, the request path, streaming, conversation storage, the inference layer, an LLM gateway, memory and retrieval, safety, cost, and the tradeoffs an interviewer will push on.

⚡ Quick Takeaways
  • The defining constraint is the response: slow, streamed, and expensive. A reply takes seconds and is produced token-by-token, so the whole architecture optimizes for time-to-first-token and streaming, not request/response latency.
  • Streaming is the core UX — use Server-Sent Events (SSE) to push tokens as they're generated; the connection stays open for the duration of a single answer.
  • Separate the stateless API tier from the GPU inference tier. They scale on completely different axes (cheap CPU boxes vs scarce, expensive GPUs) and have different failure modes.
  • The model is stateless; you rebuild context every turn. "Memory" is the harness re-sending conversation history (plus retrieved facts) on each request, within a finite context window.
  • Put a gateway in front of inference for routing, per-user rate limits and token quotas, prompt caching, and provider/model fallback.
  • Cost and capacity are dominated by GPUs. Continuous batching, KV-cache reuse, quantization, and routing cheap queries to small models are the levers that keep it affordable.
  • It can produce harmful or wrong output, so moderation, abuse limits, and evals are first-class components, not afterthoughts.
tldr

A chat assistant is a streaming front door (SSE) over a stateless API tier, backed by a GPU inference tier fronted by an LLM gateway. Conversations live in a regular database; "memory" is just resending history each turn within the context window, optionally augmented by retrieval (RAG). The hard parts are all consequences of one fact — answers are generated slowly and expensively on GPUs — which forces streaming, continuous batching, caching, quotas, and careful cost control. Safety and evals run alongside the request path.

STATELESS APP TIER DATA STORES LLM SERVING · GPU TIER OBSERVABILITY · EVALS · FEEDBACK SSE history top-k prompt token stream Client API Gateway · Load Balancer Chat Orchestrator ×N stateless · holds SSE · builds prompt Input Moderation Embedder Conversation DB Vector DB · RAG Object Store (files / blobs) LLM Gateway route · token quota · fallback Semantic / prompt cache Inference Queue GPU pools continuous batching · KV cache small-model pool large-model pool autoscaled · ~1,000 GPUs @ peak Metrics Evals Feedback
End-to-end architecture — a stateless app tier (auth, orchestration, moderation, embedding) over data stores and a GPU serving tier (gateway → queue → batched GPU pools); tokens stream back over SSE, and the path emits to observability & evals

Step 1 — Clarify the Requirements

As always, start by scoping. The interviewer wants to see you separate the chat product from the LLM plumbing, and decide what's in scope. A focused set:

Functional requirements

Non-functional requirements

interview tip

Say out loud that you're treating the model as a black box that streams tokens, slowly and for a per-token cost. That single framing drives every later decision — streaming transport, batching, quotas, caching — and signals you understand what makes this different from designing a messaging app.

Step 2 — Back-of-the-Envelope Estimates

Rough numbers keep the design honest, and here they expose why GPUs dominate. Assume 10M daily active users, ~10 messages each → 100M messages/day~1,200 messages/sec average, call it ~5,000/sec peak. Each answer averages, say, 500 output tokens.

The takeaway to state explicitly: this is a compute-bound system, not a storage- or QPS-bound one. Most of the architecture exists to use those ~1,000 GPUs efficiently.

Step 3 — High-Level Architecture

Split the system into a stateless application tier (cheap to scale, handles auth, conversations, and the streaming connection) and a GPU inference tier (scarce, expensive, fronted by a gateway and a queue). Keeping them separate is the single most important structural decision.

request path — clients to GPUs
client ──SSE──▶ API gateway / LB
                    │
                    ▼
            chat service (stateless)
             ├─ auth, rate-limit check
             ├─ load conversation history  ◀── Conversation DB
             ├─ (optional) retrieve context ◀── Vector DB (RAG)
             ├─ moderate input
             └─ build prompt ──▶ LLM gateway
                                   ├─ route by model/tier
                                   ├─ token quota check
                                   ├─ prompt / semantic cache  ◀── Cache
                                   └─ enqueue ──▶ inference queue
                                                     │
                                                     ▼
                                          GPU inference workers (batched)
                                                     │  tokens stream back
                                   ◀────────────── token stream ──────────┘
            chat service relays tokens ──SSE──▶ client
                    │ on completion
                    └─ persist assistant message ──▶ Conversation DB

Key components: an API/chat service (stateless, holds the SSE connection and orchestrates a turn), a Conversation DB (durable message history), a Vector DB (optional, for retrieval), an LLM gateway (routing, quotas, caching, fallback), an inference queue, and the GPU inference workers that actually run the model. A separate moderation path and an offline eval/analytics pipeline sit alongside.

Step 4 — Streaming the Response

The defining UX choice. Because an answer takes seconds, you must not make users stare at a spinner — you stream tokens as they're generated. The standard transport is Server-Sent Events (SSE): a single long-lived HTTP response where the server pushes data: events. SSE fits perfectly because the stream is one-directional (server → client) for the duration of an answer, it's plain HTTP (works through proxies and load balancers), and it auto-reconnects.

TransportFit for token streaming
SSEIdeal: one-way server push over plain HTTP, simple, proxy-friendly. The default choice.
WebSocketWorks, but bidirectional and heavier than needed; useful if you want rich duplex (live voice, interrupts) on the same channel.
Long pollingFallback only — re-establishing a connection per chunk is wasteful for token-level streaming.
the chat service relays a token stream over SSE
async def stream_reply(conversation_id, user_msg):
    history = db.load_history(conversation_id)
    prompt  = build_prompt(history, user_msg)     # rebuild context every turn
    full = []
    async for token in gateway.generate(prompt):  # tokens arrive from a GPU worker
        full.append(token)
        yield f"data: {token}\n\n"             # SSE frame, flushed immediately
    db.save(conversation_id, user_msg, "".join(full))  # persist after stream ends
    yield "data: [DONE]\n\n"

Two consequences worth raising in an interview. First, the connection is stateful for the duration of one answer — the chat box holding it must stay alive ~5s per request, so a single box handles far fewer concurrent requests than a typical stateless API; size the fleet on concurrency, not QPS. Second, "stop generation" is a real feature: the client closes the SSE stream, and the chat service must propagate cancellation down to the GPU worker so it stops decoding and frees the slot — otherwise you keep paying for tokens nobody will read.

Step 5 — Conversations and State

This part is refreshingly normal. Conversations and messages are classic relational/document data with a simple schema; the volume (~100GB/day of text) is small. The interesting design choice is how history feeds back into the model.

conversation schema
conversations(id, user_id, title, created_at, updated_at)
messages(id, conversation_id, role, content, token_count, created_at)
              role ∈ {"user", "assistant", "system", "tool"}

# read pattern: latest N messages for a conversation, ordered by created_at
# partition/shard by conversation_id (or user_id) — reads are per-thread

Partition by conversation_id (or user_id): every read is "give me this thread's messages," so co-locating a conversation avoids cross-shard reads. The store can be Postgres or a wide-column/document DB — the access pattern is simple key-ordered reads, so almost anything works. Because the model has a finite context window, you don't blindly send the entire thread: you send the most recent messages that fit, and for very long conversations you summarize older turns into a compact running summary that's prepended — trading some fidelity for staying within the window (and within token cost).

Step 6 — The Inference Layer

This is the heart of the system and what makes it different from any CRUD design. GPU workers run the model and turn a prompt into a stream of tokens. Three ideas dominate how you make this efficient.

Continuous batching

GPUs are massively parallel, so running one request at a time wastes them. Inference servers (vLLM, TGI, TensorRT-LLM) use continuous (in-flight) batching: they merge many users' requests into one batch on the GPU and, crucially, swap finished sequences out and new ones in each decoding step rather than waiting for the whole batch to finish. This keeps the GPU saturated and is the single biggest throughput lever.

The KV cache

Generating each new token requires attention over all previous tokens. Recomputing that every step would be quadratic, so workers keep a KV cache — the attention keys/values for the prompt and tokens so far — in GPU memory. This makes decoding fast but means GPU memory per in-flight request grows with context length; the KV cache, not compute, is often what caps how many concurrent streams a GPU can hold. Prefix caching reuses the KV cache for a shared prompt prefix (e.g. the same system prompt across users), cutting the cost of the prompt-processing ("prefill") phase.

queued req E req F GPU worker — one batch, decoded in lockstep seq A seq B seq C seq D done → swap out KV cache (grows with context) next token tokens out
Continuous batching & the KV cache — finished sequences swap out and queued ones swap in each decoding step; each sequence's KV cache (GPU memory) grows with its context length

Prefill vs decode

A request has two phases with different cost profiles: prefill (process the whole prompt in parallel — compute-heavy, fast, determines TTFT) and decode (generate tokens one at a time — memory-bandwidth-bound, sequential, determines tokens/sec). Some systems even disaggregate them onto different GPU pools. You don't need that depth in an interview, but naming the two phases and that TTFT comes from prefill shows real understanding.

key point

Treat GPUs as a scarce, batched, memory-constrained pool, not as ordinary stateless workers. Throughput comes from continuous batching; concurrency is capped by KV-cache memory; latency (TTFT) comes from prefill. These three facts explain most of the rest of the design.

Step 7 — The LLM Gateway

Don't let the chat service call GPU workers directly. Put a gateway in between — the same pattern as an API gateway, specialized for models. It centralizes the cross-cutting concerns that every request needs.

gateway: route, quota, cache, then dispatch
def handle(req):
    if not quota.allow(req.user, est_tokens=req.size()):
        return error(429)                  # token budget exceeded
    if hit := cache.lookup(req.prompt):       # exact or semantic match
        return hit                          # 0 GPU cost
    model = router.pick(req)               # small vs large, by difficulty/tier
    try:
        return pool[model].enqueue(req)      # into the batched inference queue
    except Saturated:
        return pool[fallback].enqueue(req)  # degrade, don't fail

Step 8 — Memory, Context, and RAG

Users perceive the assistant as having memory, but the model is stateless — "memory" is something the harness rebuilds every turn by choosing what to put in the context window. There are three tiers worth distinguishing:

All three are the same move: fetch the right text and put it in the window before calling the model. The design implication is a retrieval step (embedding + vector search) on the request path, and a budget decision — how many tokens of history vs retrieved context vs room for the answer. (Retrieval at scale is its own design; this is the hook for a RAG system deep dive.)

Step 9 — Safety and Abuse

Unlike a CRUD app, this system can emit harmful content and is a magnet for abuse, so safety is a real subsystem. On the way in, a moderation check (a classifier, often a smaller model) screens for disallowed input and prompt-injection attempts. On the way out, generated tokens can be screened too — though streaming makes this tricky, since you've already sent earlier tokens; a common compromise is to moderate in small windows and cut the stream if it crosses a line. Around all of it sit per-user rate limits and token budgets (to stop scraping and cost-bombing), authentication, and audit logging. Treat the gateway and moderation as the policy-enforcement layer the model itself can't provide.

Step 10 — Cost and Latency Optimization

Because ~1,000 GPUs dominate the bill, cost optimization is a design feature, not an afterthought. The main levers:

LeverWhat it buys
Continuous batchingHighest GPU utilization → most tokens per GPU. The biggest single win.
Model tiering / routingCheap small model for easy/short queries; reserve the large model for hard ones.
Prompt & semantic cachingRepeat or near-repeat questions cost ~0 GPU.
Prefix caching (KV reuse)Shared system prompts / long docs aren't re-processed per request.
Quantization (e.g. 8-/4-bit)More requests per GPU and lower memory, at a small quality cost.
Token caps & summarizationBounding context and output length directly bounds cost per turn.

On latency, the metric users feel is TTFT (driven by queue wait + prefill), and after that inter-token latency (decode speed). You improve TTFT by keeping queues short (autoscale GPU pools, shed load to smaller models under pressure) and by prefix-caching long shared prompts so prefill is cheaper. Streaming hides total latency: as long as words start flowing in ~1s and keep coming faster than the user reads, a 6-second answer feels fine.

Step 11 — Bottlenecks and Tradeoffs

Close by naming where this design strains and what you'd trade — interviewers reward this more than another box on the diagram.

takeaway

Strip away the LLM mystique and a chat assistant is a streaming front door over a scarce GPU pool. Get four things right and the rest follows: stream tokens (SSE), separate the stateless API tier from the GPU tier, rebuild context every turn within the window, and front inference with a gateway that does routing, token quotas, caching, and fallback. Everything else — batching, KV cache, moderation, cost levers — exists to serve those expensive tokens efficiently and safely.

🎯 interview hot-takes

Why is this different from a chat app? The response is generated token-by-token on a GPU — slow and costly — so you optimize for time-to-first-token and streaming, and the GPU tier dominates capacity and cost.
What transport for streaming? SSE — one-way server push over plain HTTP, simple and proxy-friendly. WebSocket only if you need rich bidirectional (voice/interrupts).
How does "memory" work? The model is stateless; you resend recent history (summarizing old turns) plus any retrieved facts each turn, within the context window.
What's the biggest throughput lever? Continuous batching on the GPU workers; concurrency is then capped by KV-cache memory, and TTFT by the prefill phase.
How do you control cost? Model tiering, prompt/semantic caching, prefix (KV) caching, quantization, token quotas, and summarizing long contexts.
How do you handle a capacity spike? Queue with backpressure and degrade to smaller models — turn a GPU shortage into slower service, not an outage; admission-control free users first.

← prev
Design a Chat App