A chat application sits at one of the most demanding intersections of system design: it must handle billions of users, hundreds of billions of messages per day, sub-second delivery latency, reliable offline queuing, and complex group semantics — all while maintaining end-to-end encryption and accurate presence information. The central challenge is that every message must be delivered to a specific device, not just a database record — which means the system must track live connection state across thousands of servers and route messages to exactly the right server in real time.

⚡ Quick Takeaways
  • WebSockets — persistent full-duplex connection eliminates polling lag; single connection per client enables true server-push.
  • WebSocket Manager (Redis KV) — maps user_id → websocket_server:port; enables any service to route a message to the right server across 65K+ ports.
  • Snowflake IDs — 64-bit: 41-bit ms timestamp + 10-bit machine ID + 12-bit sequence; distributed unique IDs with built-in time ordering for message display.
  • Online vs. offline delivery — online: direct WebSocket delivery; offline: Pending status + push notification (APNs/FCM) until reconnect.
  • Write fanout vs. read fanout — write fanout copies messages to each recipient's mailbox at send time; read fanout queries a shared message store at read time. Write fanout wins for small groups; read fanout wins for large ones.
  • Group size cap — bounded fan-out cost (WeChat 500-user ceiling); each group message requires one delivery per member.
  • Last seen sampling — poll presence ~1 min intervals; minor accuracy loss accepted to avoid write storm at 4B users.
tldr

Use WebSockets for full-duplex, persistent client-server connections. Route messages through a WebSocket Manager (key-value store) that maps user IDs to servers. Generate message IDs with Snowflake (time-ordered, distributed, 64-bit). Queue messages as "pending" for offline users and deliver via push notification. Use write fanout for 1:1 and small groups; switch to read fanout (shared message log) for large groups. Cap group size to bound worst-case fan-out cost.

High-level chat app architecture
High-level chat app architecture

Step 1 — Clarify Requirements

Chat applications range from simple 1:1 SMS-style messaging to real-time collaborative workspaces. Scope the problem before drawing any diagrams.

Functional

Non-Functional

Step 2 — Capacity Estimation

WhatsApp / WeChat scale: 4 billion users, with roughly 1 billion daily active (DAU). Each DAU sends ~100 messages/day.

Message Volume

WebSocket Connections

Storage

Step 3 — API Design

HTTP + WS
# Auth
POST /api/auth/login
     { phone_number, otp }
     → { access_token, refresh_token }

# WebSocket connection (upgrade)
GET  /ws/connect
     Authorization: Bearer {access_token}
     → 101 Switching Protocols
        Registers: user_id → ws_server:port in WebSocket Manager

# Send message (over WS frame, not REST)
-- Client sends WS frame:
{ type: "message", idempotency_key: "uuid",
  receiver_id: 456, content: "hello", media_url: null }
-- Server replies:
{ type: "ack", message_id: 7234567890123, status: "sent" }

# REST: message history (pagination by cursor)
GET  /api/conversations/{conv_id}/messages?before_id=7234567890123&limit=50

# REST: create group
POST /api/groups
     { name, member_ids: [1, 2, 3 ...] }

# REST: upload media (returns CDN URL)
POST /api/media/upload
     Content-Type: multipart/form-data
     → { media_url, media_id }

Message sending happens over the WebSocket connection, not via REST, because the connection is already open and adding REST round-trips would waste the persistent connection. The idempotency_key is client-generated (UUID) and prevents duplicate message creation on retry — if the client sends the same key twice (e.g., because the ACK was lost), the server returns the already-created message ID without creating a second message.

Step 4 — Protocol Selection: Why WebSockets

The central challenge of chat is server-to-client message push. Three approaches exist:

ProtocolProblem
Short pollingLatency determined by poll interval; frequent polls waste resources, infrequent polls feel laggy
Long pollingRepeated connection setup/teardown overhead; multiple simultaneous requests can cause message ordering issues
WebSocketsPersistent full-duplex connection; eliminates polling overhead; supports true real-time bidirectional messaging
Server-Sent Events (SSE)Server-push only; client cannot send over the same connection; still requires a separate HTTP channel for sends

WebSockets win because they maintain a single persistent connection per client, enabling the server to push messages the instant they arrive with no polling lag and no connection re-establishment overhead. The connection stays open for the lifetime of the user's session; a heartbeat ping/pong every 30 seconds detects dead connections without requiring the client to poll.

Step 5 — Core Services

WebSocket connection management and message routing
WebSocket connection management and message routing

Step 6 — Data Model

CQL / schema
-- Messages table: partitioned by conversation for efficient history reads
CREATE TABLE messages (
  conversation_id BIGINT,
  message_id      BIGINT,    -- Snowflake ID (time-ordered)
  sender_id       BIGINT,
  content         TEXT,     -- encrypted at rest
  media_url       TEXT,     -- null for text-only
  status          ENUM,     -- Sent | Delivered | Read
  idempotency_key UUID,    -- dedup on retry
  created_at      TIMESTAMP,
  PRIMARY KEY     ((conversation_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- User inbox: each user's list of conversations with unread counts
CREATE TABLE user_inbox (
  user_id         BIGINT,
  conversation_id BIGINT,
  last_message_id BIGINT,
  unread_count    INT,
  updated_at      TIMESTAMP,
  PRIMARY KEY     (user_id, updated_at)
) WITH CLUSTERING ORDER BY (updated_at DESC);

-- Group membership
CREATE TABLE group_members (
  group_id        BIGINT,
  user_id         BIGINT,
  role            TEXT,     -- member | admin
  joined_at       TIMESTAMP,
  PRIMARY KEY     (group_id, user_id)
);

-- Delivery receipts: per-message per-recipient status
CREATE TABLE receipts (
  message_id      BIGINT,
  recipient_id    BIGINT,
  status          TEXT,     -- Delivered | Read
  updated_at      TIMESTAMP,
  PRIMARY KEY     (message_id, recipient_id)
);

The messages table is partitioned by conversation_id so the most common query — "load the last 50 messages in this conversation" — is a single-partition scan. The clustering key message_id (a Snowflake ID) keeps messages ordered by time within the partition. Cassandra is the right storage here because the write pattern (high throughput, append-only) and read pattern (range scan within a partition, by time) map perfectly to Cassandra's strengths.

Message object schema
Message object schema

Step 7 — Message ID Generation: Snowflake

Message IDs must be globally unique, time-ordered (so messages sort correctly in the UI without a separate timestamp sort), and generated without coordination (so there is no single-point-of-failure ID generator). Snowflake IDs satisfy all three requirements:

bit layout
Snowflake 64-bit ID layout:
┌─────────────────────────────┬──────────────┬──────────────────┐
│   41 bits: ms timestamp     │ 10-bit nodeID│  12-bit sequence │
│   (epoch offset, ~69 years) │  (1024 nodes)│  (4096/ms/node)  │
└─────────────────────────────┴──────────────┴──────────────────┘

Max throughput: 1024 nodes × 4096 IDs/ms = 4.19 million IDs/ms
Time ordering: IDs generated later are always numerically greater
No coordination: each node generates independently

The 41-bit timestamp field (milliseconds since a custom epoch) means IDs are sortable without a separate created_at field — comparing two Snowflake IDs directly tells you which message came first. The 10-bit node ID (configured per Message Service instance at startup) guarantees uniqueness across the fleet without any global lock or centralized ID service.

Step 8 — Message Delivery: Online vs. Offline

When a message is sent, the Message Service checks whether the recipient has an active WebSocket connection by querying the WebSocket Manager. Two paths diverge:

flow
Sender sends message over WebSocket
  → WebSocket Gateway receives WS frame
  → Message Service:
      1. Dedup check: idempotency_key in Redis (TTL 24h)
      2. Generate Snowflake message_id
      3. Persist to Cassandra (status: Sent)
      4. Send ACK to sender: { message_id, status: "sent" }
      5. Look up recipient in WebSocket Manager

  [Recipient is ONLINE]
    → Forward to recipient's WebSocket server (via internal RPC)
    → WebSocket server pushes to client WS connection
    → Recipient client ACKs receipt
    → Message Service updates status: Delivered
    → Push Delivered receipt to sender via WS

  [Recipient is OFFLINE]
    → Message stays in Cassandra (status: Pending)
    → Notification Service sends push notification (APNs/FCM)
    → On recipient reconnect:
        Client sends { last_seen_message_id: 7234567890000 }
        Server returns all messages with id > last_seen
        Server updates status: Delivered
        Sender receives Delivered receipt

At-Least-Once Delivery and Deduplication

The system guarantees at-least-once delivery at the network level — a message may be attempted multiple times on retries. The application layer must deduplicate to achieve the user-visible guarantee of exactly-once. The deduplication key is the client-supplied idempotency_key (UUID), stored in Redis with a 24-hour TTL. On retry, the Message Service checks Redis before writing to Cassandra; if the key exists, it returns the existing message ID without creating a duplicate.

Step 9 — Fan-out: Write Fanout vs. Read Fanout

Fan-out is the architectural decision that determines how messages get from a sender to all recipients. There are two fundamentally different approaches, each with different tradeoffs:

ApproachHow it worksWrite costRead costBest for
Write fanout (push model)At send time, copy the message into each recipient's mailbox / inboxO(N) writes where N = recipientsO(1) per-user inbox scanSmall groups, DMs; N is bounded
Read fanout (pull model)Store message once in a shared log; each recipient fetches on readO(1) writeO(members) per read (each reader scans the shared log)Large groups, public channels; write cost would explode

Hybrid Strategy

Production systems (WhatsApp, Slack, Telegram) use a hybrid:

the fanout tradeoff in one sentence

Write fanout trades write amplification for fast reads; read fanout trades write simplicity for read complexity. The crossover point is roughly 100–200 members: below that, write fanout wins; above that, read fanout avoids an O(N) write storm.

Step 10 — Group Chat and Ordering Guarantees

Group messages replace receiver_id with a conversation_id and must be delivered to all group members simultaneously. This fan-out is the main scalability challenge in group chat: a message to a 500-person group requires 500 individual delivery attempts. The design caps group size (WeChat's 500-user ceiling is the reference) to bound the maximum fan-out cost and delivery lag.

Message Ordering Within a Conversation

Two users sending messages simultaneously can both "win" and produce ordering conflicts. The solution is to use the Snowflake ID as the canonical ordering key — messages are always displayed sorted by message_id ascending. Since Snowflake IDs incorporate a millisecond timestamp, two messages sent within the same millisecond from different nodes are differentiated by the node ID bits, ensuring a deterministic total order even under concurrent sends.

For conversations where strict causal ordering matters (e.g., "reply to message X"), a reply_to_message_id field is included in the message schema. The client UI renders the thread view by following these links, regardless of the arrival order of the messages.

Step 11 — Read Receipts

The iconic double-tick / blue-tick indicators (sent → delivered → read) are a read receipt system. Implementation details:

For group messages, the sender sees "delivered to all" only when every member has a Delivered receipt in the receipts table. This is O(members) reads per "delivered to all" check — the reason WhatsApp shows a per-member drill-down list rather than a single aggregate indicator for groups.

Batching Receipt Writes

At 100B messages/day, individual receipt writes would add another 200–300B writes/day to the database (two receipts per message). The Notification Service batches receipt updates: delivery ACKs are accumulated in Redis for 500 ms, then flushed to Cassandra in a single batch write. This reduces write amplification by ~50–100×.

Step 12 — Presence and Last Seen

Tracking "is this user online?" and "when did they last use the app?" sounds simple but is one of the hardest scalability problems in a chat system at 4 billion users.

Naive Approach and Why It Fails

The simplest implementation would have each user write their current timestamp to a database every second. At 1B DAU × 1 write/sec = 1 billion writes/sec — this is obviously impossible.

Presence via Heartbeat Sampling

The Presence Service uses a sampled approach:

Presence Subscription (Contacts' Online Status)

Users want to know which of their contacts are online. The naive approach (query Presence Service for each contact on every page load) generates O(contacts) requests per user open. The production approach uses a pub-sub model:

Step 13 — Media Storage

Images, audio, and video are not stored inline in message objects. Instead, the message record holds a CDN URL, and the actual binary content lives in object storage. The upload flow:

  1. Client calls POST /api/media/upload.
  2. Media Service generates a pre-signed S3 upload URL and returns it along with a media_id.
  3. Client uploads directly to S3 via the pre-signed URL (bypasses application servers entirely).
  4. S3 triggers a Lambda/event that notifies the Media Service of upload completion.
  5. Media Service records the media_id → CDN URL mapping and returns the CDN URL to the client.
  6. Client includes the CDN URL in the message payload.

Media deduplication: before uploading, the client computes a SHA-256 hash of the file and sends it to the Media Service. If the hash already exists in the database (another user already uploaded the same file), the server returns the existing CDN URL without a new upload — zero bytes transferred for duplicates. This is especially effective for viral media (memes, voice notes) where the same file is shared by thousands of users.

Step 14 — Scaling and Fault Tolerance

WebSocket Server Scaling

Cassandra Sharding for Messages

Message Queue and Replay

The Message Service publishes every incoming message to Kafka before persisting to Cassandra. This serves two purposes: it decouples the write path from downstream consumers (notification service, analytics, search indexer), and it enables replay — if a consumer falls behind or crashes, it can replay from its last committed Kafka offset without losing messages.

Key Tradeoffs

DecisionChoiceTrade-off accepted
Real-time deliveryWebSocket (persistent)Stateful servers; harder horizontal scaling than stateless HTTP
Message ID generationSnowflake (distributed)Node IDs must be assigned carefully; clock skew can cause rare ordering issues
Fan-out strategyHybrid (write <100, read >100)Complexity of two code paths; threshold tuning required per platform
Message storageCassandra (partitioned by conv)No cross-conversation queries; analytics must use a separate OLAP store
Presence trackingSampled heartbeat (60s)Online status is slightly stale; "last seen 5 minutes ago" may be off by up to 60s
Delivery guaranteeAt-least-once + client dedupClient must handle and discard duplicates (idempotency_key check)
takeaway

Chat design is mostly about three hard problems: real-time delivery (WebSockets + WebSocket Manager for routing), offline queuing (Pending status + push notifications + catch-up on reconnect), and fan-out strategy (write fanout for small groups, read fanout for large ones). Snowflake IDs solve distributed unique ID generation with built-in time ordering. Group size caps are an explicit scalability lever. Presence sampling trades minor accuracy for write throughput — the classic availability-consistency tradeoff applied to a non-obvious dimension.

🎯 interview hot-takes

How do you route a message to the correct server when millions of WebSocket connections are spread across many servers? A WebSocket Manager (backed by Redis) stores user_id → server:port; any service can look up the target server in O(1) and forward the message via internal RPC.
Why cap group chat size? A group message fans out to every member — a 1,000-person group means 1,000 individual deliveries. Capping size bounds worst-case fan-out latency and prevents a single message from saturating the delivery pipeline.
What happens if a message is sent to an offline user? The message is stored with status Pending; a push notification (APNs/FCM) wakes the recipient's device; on reconnect the client sends its last-seen ID, the server delivers all missed messages, and status is updated to Delivered.
Write fanout vs. read fanout — when do you use each? Write fanout copies to each inbox at send time (fast reads, O(N) writes) — use for DMs and small groups. Read fanout stores once in a shared log (O(1) write, O(members) reads) — use for large groups where write amplification would be prohibitive.
How do you prevent duplicate messages on retry? Client-generated idempotency keys (UUID) stored in Redis with a 24-hour TTL. The Message Service checks the key before writing; if it already exists, returns the existing message ID without creating a duplicate.

← previous
Design a Video Streaming Platform