Design ChatGPT — an AI Chat Assistant at Scale

"Design ChatGPT" has quietly become one of the most common system design interview questions, and it's a great one: it looks like a chat app, but the interesting parts are nothing like one. The response isn't a row you read from a database — it's generated token by token by a model running on a GPU, costs real money per request, takes seconds (not milliseconds), and can fail or misbehave in ways a CRUD service never does. This piece walks the whole design end to end: gathering requirements, sizing the load, the request path, streaming, conversation storage, the inference layer, an LLM gateway, memory and retrieval, safety, cost, and the tradeoffs an interviewer will push on.

⚡ Quick Takeaways

The defining constraint is the response: slow, streamed, and expensive. A reply takes seconds and is produced token-by-token, so the whole architecture optimizes for time-to-first-token and streaming, not request/response latency.
Streaming is the core UX — use Server-Sent Events (SSE) to push tokens as they're generated; the connection stays open for the duration of a single answer.
Separate the stateless API tier from the GPU inference tier. They scale on completely different axes (cheap CPU boxes vs scarce, expensive GPUs) and have different failure modes.
The model is stateless; you rebuild context every turn. "Memory" is the harness re-sending conversation history (plus retrieved facts) on each request, within a finite context window.
Put a gateway in front of inference for routing, per-user rate limits and token quotas, prompt caching, and provider/model fallback.
Cost and capacity are dominated by GPUs. Continuous batching, KV-cache reuse, quantization, and routing cheap queries to small models are the levers that keep it affordable.
It can produce harmful or wrong output, so moderation, abuse limits, and evals are first-class components, not afterthoughts.

tldr

A chat assistant is a streaming front door (SSE) over a stateless API tier, backed by a GPU inference tier fronted by an LLM gateway. Conversations live in a regular database; "memory" is just resending history each turn within the context window, optionally augmented by retrieval (RAG). The hard parts are all consequences of one fact — answers are generated slowly and expensively on GPUs — which forces streaming, continuous batching, caching, quotas, and careful cost control. Safety and evals run alongside the request path.

End-to-end architecture — a stateless app tier (auth, orchestration, moderation, embedding) over data stores and a GPU serving tier (gateway → queue → batched GPU pools); tokens stream back over SSE, and the path emits to observability & evals

Step 1 — Clarify the Requirements

As always, start by scoping. The interviewer wants to see you separate the chat product from the LLM plumbing, and decide what's in scope. A focused set:

Functional requirements

A user sends a message in a conversation and gets a model-generated reply, streamed back as it's produced.
Conversations are persistent and multi-turn — the assistant remembers earlier messages in the same thread.
Users can list past conversations, resume them, rename, and delete.
Users can stop a generation mid-stream and regenerate a response.
(Stretch) The model can call tools (web search, code execution) and ground answers in retrieved documents.

Non-functional requirements

Low time-to-first-token (TTFT) — the user should see words within ~1 second; total answers take many seconds, which is fine if streaming.
High availability for the API/chat tier; graceful degradation when GPU capacity is saturated (queue, or fall back to a smaller model).
Scale to millions of daily users and tens of thousands of concurrent in-flight generations.
Cost control — GPU inference is the dominant cost; the design must actively manage it.
Safety — filter abusive input and harmful output; enforce per-user limits.

interview tip

Say out loud that you're treating the model as a black box that streams tokens, slowly and for a per-token cost. That single framing drives every later decision — streaming transport, batching, quotas, caching — and signals you understand what makes this different from designing a messaging app.

Step 2 — Back-of-the-Envelope Estimates

Rough numbers keep the design honest, and here they expose why GPUs dominate. Assume 10M daily active users, ~10 messages each → 100M messages/day ≈ ~1,200 messages/sec average, call it ~5,000/sec peak. Each answer averages, say, 500 output tokens.

Token throughput: 5,000 req/s × 500 tokens ≈ 2.5M output tokens/sec at peak. This, not QPS, is the real load number — it's what GPUs must produce.
Concurrency: if an answer takes ~5s to generate, peak concurrent generations ≈ 5,000 × 5 = ~25,000 in-flight streams. Each holds an open connection plus GPU memory for its KV cache.
GPU sizing: if one GPU sustains ~2,500 output tokens/sec (with batching), you need on the order of 2.5M / 2,500 ≈ 1,000 GPUs for peak — the headline cost. (Numbers are illustrative; real throughput depends on model size, batching, and hardware.)
Storage: 100M messages/day × ~1KB ≈ 100GB/day of conversation text — trivial compared to the compute. Storage is cheap here; compute is the scarce resource.

The takeaway to state explicitly: this is a compute-bound system, not a storage- or QPS-bound one. Most of the architecture exists to use those ~1,000 GPUs efficiently.

Step 3 — High-Level Architecture

Split the system into a stateless application tier (cheap to scale, handles auth, conversations, and the streaming connection) and a GPU inference tier (scarce, expensive, fronted by a gateway and a queue). Keeping them separate is the single most important structural decision.

request path — clients to GPUs

client ──SSE──▶ API gateway / LB
                    │
                    ▼
            chat service (stateless)
             ├─ auth, rate-limit check
             ├─ load conversation history  ◀── Conversation DB
             ├─ (optional) retrieve context ◀── Vector DB (RAG)
             ├─ moderate input
             └─ build prompt ──▶ LLM gateway
                                   ├─ route by model/tier
                                   ├─ token quota check
                                   ├─ prompt / semantic cache  ◀── Cache
                                   └─ enqueue ──▶ inference queue
                                                     │
                                                     ▼
                                          GPU inference workers (batched)
                                                     │  tokens stream back
                                   ◀────────────── token stream ──────────┘
            chat service relays tokens ──SSE──▶ client
                    │ on completion
                    └─ persist assistant message ──▶ Conversation DB

Key components: an API/chat service (stateless, holds the SSE connection and orchestrates a turn), a Conversation DB (durable message history), a Vector DB (optional, for retrieval), an LLM gateway (routing, quotas, caching, fallback), an inference queue, and the GPU inference workers that actually run the model. A separate moderation path and an offline eval/analytics pipeline sit alongside.

Step 4 — Streaming the Response

The defining UX choice. Because an answer takes seconds, you must not make users stare at a spinner — you stream tokens as they're generated. The standard transport is Server-Sent Events (SSE): a single long-lived HTTP response where the server pushes data: events. SSE fits perfectly because the stream is one-directional (server → client) for the duration of an answer, it's plain HTTP (works through proxies and load balancers), and it auto-reconnects.

Transport	Fit for token streaming
SSE	Ideal: one-way server push over plain HTTP, simple, proxy-friendly. The default choice.
WebSocket	Works, but bidirectional and heavier than needed; useful if you want rich duplex (live voice, interrupts) on the same channel.
Long polling	Fallback only — re-establishing a connection per chunk is wasteful for token-level streaming.

the chat service relays a token stream over SSE

async def stream_reply(conversation_id, user_msg):
    history = db.load_history(conversation_id)
    prompt  = build_prompt(history, user_msg)     # rebuild context every turn
    full = []
    async for token in gateway.generate(prompt):  # tokens arrive from a GPU worker
        full.append(token)
        yield f"data: {token}\n\n"             # SSE frame, flushed immediately
    db.save(conversation_id, user_msg, "".join(full))  # persist after stream ends
    yield "data: [DONE]\n\n"

Two consequences worth raising in an interview. First, the connection is stateful for the duration of one answer — the chat box holding it must stay alive ~5s per request, so a single box handles far fewer concurrent requests than a typical stateless API; size the fleet on concurrency, not QPS. Second, "stop generation" is a real feature: the client closes the SSE stream, and the chat service must propagate cancellation down to the GPU worker so it stops decoding and frees the slot — otherwise you keep paying for tokens nobody will read.

Step 5 — Conversations and State

This part is refreshingly normal. Conversations and messages are classic relational/document data with a simple schema; the volume (~100GB/day of text) is small. The interesting design choice is how history feeds back into the model.

conversation schema

conversations(id, user_id, title, created_at, updated_at)
messages(id, conversation_id, role, content, token_count, created_at)
              role ∈ {"user", "assistant", "system", "tool"}

# read pattern: latest N messages for a conversation, ordered by created_at
# partition/shard by conversation_id (or user_id) — reads are per-thread

Partition by conversation_id (or user_id): every read is "give me this thread's messages," so co-locating a conversation avoids cross-shard reads. The store can be Postgres or a wide-column/document DB — the access pattern is simple key-ordered reads, so almost anything works. Because the model has a finite context window, you don't blindly send the entire thread: you send the most recent messages that fit, and for very long conversations you summarize older turns into a compact running summary that's prepended — trading some fidelity for staying within the window (and within token cost).

Step 6 — The Inference Layer

This is the heart of the system and what makes it different from any CRUD design. GPU workers run the model and turn a prompt into a stream of tokens. Three ideas dominate how you make this efficient.

Continuous batching

GPUs are massively parallel, so running one request at a time wastes them. Inference servers (vLLM, TGI, TensorRT-LLM) use continuous (in-flight) batching: they merge many users' requests into one batch on the GPU and, crucially, swap finished sequences out and new ones in each decoding step rather than waiting for the whole batch to finish. This keeps the GPU saturated and is the single biggest throughput lever.

The KV cache

Generating each new token requires attention over all previous tokens. Recomputing that every step would be quadratic, so workers keep a KV cache — the attention keys/values for the prompt and tokens so far — in GPU memory. This makes decoding fast but means GPU memory per in-flight request grows with context length; the KV cache, not compute, is often what caps how many concurrent streams a GPU can hold. Prefix caching reuses the KV cache for a shared prompt prefix (e.g. the same system prompt across users), cutting the cost of the prompt-processing ("prefill") phase.

Continuous batching & the KV cache — finished sequences swap out and queued ones swap in each decoding step; each sequence's KV cache (GPU memory) grows with its context length

Prefill vs decode

A request has two phases with different cost profiles: prefill (process the whole prompt in parallel — compute-heavy, fast, determines TTFT) and decode (generate tokens one at a time — memory-bandwidth-bound, sequential, determines tokens/sec). Some systems even disaggregate them onto different GPU pools. You don't need that depth in an interview, but naming the two phases and that TTFT comes from prefill shows real understanding.

key point

Treat GPUs as a scarce, batched, memory-constrained pool, not as ordinary stateless workers. Throughput comes from continuous batching; concurrency is capped by KV-cache memory; latency (TTFT) comes from prefill. These three facts explain most of the rest of the design.

Step 7 — The LLM Gateway

Don't let the chat service call GPU workers directly. Put a gateway in between — the same pattern as an API gateway, specialized for models. It centralizes the cross-cutting concerns that every request needs.

Routing & model tiering: send simple/short queries to a small cheap model and hard ones to a large model; route by user tier (free vs paid), region, or A/B experiment.
Token quotas & rate limits: limits are best expressed in tokens per minute, not requests per minute, since cost scales with tokens. Enforce per-user and per-org budgets here.
Caching: an exact-match cache for identical prompts, and a semantic cache (embed the prompt, return a cached answer if a near-identical question was asked) for common questions.
Fallback & resilience: if a model/provider is down or saturated, retry, degrade to a smaller model, or queue — so a GPU shortage becomes slower service, not an outage.
Observability & accounting: log tokens, latency, cost, and model per request for billing and capacity planning.

gateway: route, quota, cache, then dispatch

def handle(req):
    if not quota.allow(req.user, est_tokens=req.size()):
        return error(429)                  # token budget exceeded
    if hit := cache.lookup(req.prompt):       # exact or semantic match
        return hit                          # 0 GPU cost
    model = router.pick(req)               # small vs large, by difficulty/tier
    try:
        return pool[model].enqueue(req)      # into the batched inference queue
    except Saturated:
        return pool[fallback].enqueue(req)  # degrade, don't fail

Step 8 — Memory, Context, and RAG

Users perceive the assistant as having memory, but the model is stateless — "memory" is something the harness rebuilds every turn by choosing what to put in the context window. There are three tiers worth distinguishing:

Working memory: the recent messages of the current thread, sent verbatim each turn (with older turns summarized when the window fills).
Long-term memory: durable facts about the user ("prefers Python", "is allergic to peanuts") stored separately and injected when relevant — often retrieved by embedding similarity.
Retrieved knowledge (RAG): for questions over private/up-to-date documents, embed the query, fetch the top-k relevant chunks from a vector DB, and prepend them to the prompt so the answer is grounded in real sources.

All three are the same move: fetch the right text and put it in the window before calling the model. The design implication is a retrieval step (embedding + vector search) on the request path, and a budget decision — how many tokens of history vs retrieved context vs room for the answer. (Retrieval at scale is its own design; this is the hook for a RAG system deep dive.)

Step 9 — Safety and Abuse

Unlike a CRUD app, this system can emit harmful content and is a magnet for abuse, so safety is a real subsystem. On the way in, a moderation check (a classifier, often a smaller model) screens for disallowed input and prompt-injection attempts. On the way out, generated tokens can be screened too — though streaming makes this tricky, since you've already sent earlier tokens; a common compromise is to moderate in small windows and cut the stream if it crosses a line. Around all of it sit per-user rate limits and token budgets (to stop scraping and cost-bombing), authentication, and audit logging. Treat the gateway and moderation as the policy-enforcement layer the model itself can't provide.

Step 10 — Cost and Latency Optimization

Because ~1,000 GPUs dominate the bill, cost optimization is a design feature, not an afterthought. The main levers:

Lever	What it buys
Continuous batching	Highest GPU utilization → most tokens per GPU. The biggest single win.
Model tiering / routing	Cheap small model for easy/short queries; reserve the large model for hard ones.
Prompt & semantic caching	Repeat or near-repeat questions cost ~0 GPU.
Prefix caching (KV reuse)	Shared system prompts / long docs aren't re-processed per request.
Quantization (e.g. 8-/4-bit)	More requests per GPU and lower memory, at a small quality cost.
Token caps & summarization	Bounding context and output length directly bounds cost per turn.

On latency, the metric users feel is TTFT (driven by queue wait + prefill), and after that inter-token latency (decode speed). You improve TTFT by keeping queues short (autoscale GPU pools, shed load to smaller models under pressure) and by prefix-caching long shared prompts so prefill is cheaper. Streaming hides total latency: as long as words start flowing in ~1s and keep coming faster than the user reads, a 6-second answer feels fine.

Step 11 — Bottlenecks and Tradeoffs

Close by naming where this design strains and what you'd trade — interviewers reward this more than another box on the diagram.

GPU capacity is the bottleneck and the budget. A traffic spike can't be absorbed by spinning up cheap boxes; GPUs are scarce and slow to provision. Mitigation: queue with backpressure, degrade to smaller models, and admission-control free users before paid ones.
Concurrency, not QPS, sizes the fleet. Long-lived streams mean each box/GPU holds requests for seconds; KV-cache memory caps concurrent streams per GPU.
Context window vs cost vs quality. More history/retrieved context can improve answers but costs more tokens and latency; summarization saves money but loses detail.
Quality is probabilistic. The same prompt can yield different or wrong answers, so you need evals, A/B testing of prompts/models, and feedback capture (thumbs up/down) as part of the system — not just uptime dashboards.
Multi-region is hard for GPUs. You want low TTFT globally, but GPU availability is regional and uneven; you often route to where capacity exists, trading a little latency for not failing.

takeaway

Strip away the LLM mystique and a chat assistant is a streaming front door over a scarce GPU pool. Get four things right and the rest follows: stream tokens (SSE), separate the stateless API tier from the GPU tier, rebuild context every turn within the window, and front inference with a gateway that does routing, token quotas, caching, and fallback. Everything else — batching, KV cache, moderation, cost levers — exists to serve those expensive tokens efficiently and safely.

🎯 interview hot-takes

Why is this different from a chat app? The response is generated token-by-token on a GPU — slow and costly — so you optimize for time-to-first-token and streaming, and the GPU tier dominates capacity and cost.
What transport for streaming? SSE — one-way server push over plain HTTP, simple and proxy-friendly. WebSocket only if you need rich bidirectional (voice/interrupts).
How does "memory" work? The model is stateless; you resend recent history (summarizing old turns) plus any retrieved facts each turn, within the context window.
What's the biggest throughput lever? Continuous batching on the GPU workers; concurrency is then capped by KV-cache memory, and TTFT by the prefill phase.
How do you control cost? Model tiering, prompt/semantic caching, prefix (KV) caching, quantization, token quotas, and summarizing long contexts.
How do you handle a capacity spike? Queue with backpressure and degrade to smaller models — turn a GPU shortage into slower service, not an outage; admission-control free users first.