Design a Distributed Key-Value Store — System Design

A distributed key-value store offers the simplest possible interface — get(key) and put(key, value) — yet building one that stays available and fast across thousands of nodes is a canonical hard problem. This is the design behind Amazon Dynamo, Cassandra, and Riak. The interesting decisions are all about distribution: how to spread keys across nodes without reshuffling everything when the cluster changes, how to replicate for durability and availability, how to keep replicas consistent (or decide not to), and how to keep serving through node failures with no single leader. It's the most direct application of the distributed-data ideas in our DDIA notes on replication and partitioning.

⚡ Quick Takeaways

Consistent hashing spreads keys over a ring so adding/removing a node moves only a small slice of keys — never the whole keyspace (no hash mod N).
Virtual nodes give each physical node many points on the ring, smoothing load and making rebalancing even.
Replicate to N nodes (the next N on the ring = preference list); a coordinator handles each request.
Tunable consistency via quorums — if W + R > N the read and write sets overlap, so reads see the latest write. Tune W/R for read- vs write-heavy loads.
Leaderless = always writable — concurrent writes create conflicting versions reconciled with vector clocks (siblings) or last-write-wins.
Failures are routine — hinted handoff covers temporary outages, Merkle-tree anti-entropy repairs permanent ones, and gossip spreads membership.
It's an AP system — under a partition it favors availability over strong consistency (CAP).

tldr

Partition keys with consistent hashing + virtual nodes. Replicate each key to N nodes (the preference list). Make consistency tunable with quorums (W + R > N). Because there's no leader, concurrent writes are reconciled with vector clocks (or LWW). Survive failures with hinted handoff (temporary) and Merkle-tree anti-entropy (permanent), and track membership with a gossip protocol. The result is a highly available, horizontally scalable AP store.

the hash ring (N = 3 replication)

          0 / 2^160
              ▲
       NodeD  │  NodeA
          ╲   │   ╱
           ╲  │  ╱      key "user:42" hashes here ─┐
            ╲ │ ╱                                  ▼
   ◀─────────●─────────▶   walk clockwise → store on
            ╱ │ ╲          NodeA, NodeB, NodeC  (N=3)
           ╱  │  ╲         = the "preference list"
          ╱   │   ╲
       NodeC  │  NodeB
              ▼

Step 1 — Clarify Requirements

Functional: get(key), put(key, value), delete(key); values are opaque blobs up to a modest size (say ≤ 1 MB); no complex queries or joins. Non-functional — and these are the whole point: high availability ("always writable", even during failures), low latency (single-digit ms), horizontal scalability (add nodes to grow), tunable consistency (let the caller trade consistency for latency), and partition tolerance. We explicitly accept eventual consistency as the default — that's the deliberate trade Dynamo makes for availability.

Step 2 — Capacity Estimation

Suppose 100 TB of data, average value 10 KB → ~10 billion keys. At 1M operations/sec with a 9:1 read:write ratio, that's ~900K reads and ~100K writes per second. With commodity nodes holding ~2 TB each and N=3 replication, you need on the order of (100 TB × 3) / 2 TB ≈ 150 nodes, plus headroom. These numbers justify the need for automatic partitioning, replication, and decentralized failure handling — no single coordinator could keep up, and node failures at this fleet size are a daily event.

Step 3 — API Design

API with consistency knobs

get(key, consistency = QUORUM)   → (value(s), version)
put(key, value, version, consistency = QUORUM)   → ok
delete(key, version)

# consistency ∈ {ONE, QUORUM, ALL}  — maps to how many
# replicas must respond (R for reads, W for writes)

The version passed to put is the context the client got from a prior get — it lets the store detect concurrent modifications. A read may return multiple versions when there's an unresolved conflict (more on that below).

Step 4 — Partitioning with Consistent Hashing

Naively assigning a key to hash(key) mod N nodes is a trap: changing N remaps almost every key (see our partitioning notes). Consistent hashing fixes this. Hash the key onto a fixed ring (e.g. 0…2¹⁶⁰); hash each node onto the same ring; a key belongs to the first node encountered walking clockwise. Now adding or removing a node only moves the keys between it and its neighbor — a small slice. To avoid uneven distribution (and uneven load when a node leaves), each physical node is mapped to many virtual nodes scattered around the ring, which smooths both data and load and makes a departing node's share spread across many survivors instead of dumping on one.

Step 5 — Replication

For durability and availability, each key is stored on N nodes: the node that owns it plus the next N−1 distinct physical nodes clockwise on the ring. This ordered list is the key's preference list. Any node can act as the coordinator for a request (typically the first in the preference list); it forwards the operation to the replicas and collects responses. Because every replica can serve reads and accept writes, there is no leader and thus no failover — the system stays writable as long as enough replicas are reachable.

Step 6 — Tunable Consistency: Quorums

Consistency is a dial, not a fixed setting. With N replicas, a write waits for W acknowledgements and a read queries R replicas. The key inequality:

quorum tuning (N = 3)

W + R > N   ⇒   read and write quorums overlap ⇒ reads see latest

  W=2, R=2, N=3 :  2+2 > 3  ✓  balanced (common default)
  W=1, R=3      :  fast writes, slower/consistent reads
  W=3, R=1      :  fast reads, durable but slow writes
  W=1, R=1      :  fastest, but NO overlap → may read stale

Setting	Behaviour	Use when
W=1, R=1	Lowest latency, eventually consistent	Caches, metrics, tolerate staleness
W=2, R=2 (N=3)	Strong-ish (quorum overlap)	General-purpose default
W=N or R=N	One side fully consistent, low availability	Read-mostly or write-rarely critical data

See our DDIA replication notes for the full quorum derivation. A subtlety: under a network partition, a sloppy quorum lets W/R be satisfied by other reachable nodes (not strictly the top-N), keeping the system available at the cost of weaker overlap guarantees — paired with hinted handoff below.

Step 7 — Data Versioning and Conflict Resolution

With no leader and concurrent writes allowed, two clients can update the same key on different replicas simultaneously, producing divergent versions. The store needs a way to tell whether one version descends from another (safe to overwrite) or whether they're truly concurrent (a genuine conflict). Vector clocks capture this: each value carries a list of (node, counter) pairs. If clock A is entirely ≤ clock B, B supersedes A; otherwise they're concurrent and both are kept as siblings.

vector clocks detect concurrency

write on A:  v = [A:1]
write on A:  v = [A:2]              # A:2 descends from A:1 → overwrite

now two clients branch:
  client X (saw A:2):  [A:2, B:1]   # via node B
  client Y (saw A:2):  [A:2, C:1]   # via node C
  → neither ≤ the other → CONCURRENT → keep both as siblings
  → next read returns both; application (or LWW) merges them

Reconciliation is either last-write-wins (simple, but silently drops data — risky with skewed clocks, see unreliable clocks) or application-level merge (e.g. Dynamo's shopping cart unions the carts so no add is lost). The store surfaces siblings; the app decides.

Step 8 — Handling Temporary Failures: Hinted Handoff

If a replica in the preference list is briefly down, the coordinator doesn't block. It writes to the next healthy node instead, tagged with a hint recording the intended recipient. When the downed node recovers, the holder hands the data back and deletes the hint. This hinted handoff keeps writes succeeding through transient outages — a core reason the system is "always writable."

Step 9 — Handling Permanent Failures: Anti-Entropy with Merkle Trees

When a node is gone for good (or was unreachable long enough that hints expired), replicas drift apart. To resync efficiently without comparing every key, each node keeps a Merkle tree (a tree of hashes) over its key ranges. Two replicas compare trees top-down: if the root hashes match, they're identical and no data moves; if they differ, they descend only into the differing branches, transferring just the keys that actually diverge. This anti-entropy repair minimizes the data exchanged.

Step 10 — Membership and Failure Detection: Gossip

A decentralized system can't rely on a central registry of who's alive. Instead nodes use a gossip protocol: each node periodically exchanges its view of cluster membership and ring assignments with a few random peers, so knowledge of joins, departures, and the partitioning scheme spreads epidemically across the cluster within seconds. Failure detection is similarly decentralized (e.g. a node is suspected dead if peers stop hearing from it), avoiding any single point of failure or coordination bottleneck.

Step 11 — The Read/Write Path

Putting it together for a put: a client sends the request to any node, which becomes the coordinator; it computes the preference list, sends the write (with an updated vector clock) to the N replicas, and returns success once W acknowledge. A get works symmetrically: the coordinator queries the preference list, waits for R responses, and if their versions disagree it returns all siblings (and triggers read repair — pushing the newest version back to stale replicas). The local storage on each node is typically an LSM-tree engine (see our storage & retrieval notes) optimized for high write throughput.

Step 12 — Scaling and Hot Keys

Throughput and storage grow by adding nodes; consistent hashing means each new node lifts a fair slice of load off its neighbors with minimal data movement, and virtual nodes keep the distribution even. The remaining danger is a hot key — a single key receiving disproportionate traffic — which no amount of partitioning fixes because it lives on one set of replicas. Mitigations: cache hot keys at the client/edge, or split a logical hot key into several physical keys with a suffix and merge on read.

Step 13 — Key Tradeoffs

Availability vs consistency (CAP). This is an AP design: under a partition it stays available and serves possibly-stale data rather than rejecting requests. A CP store (e.g. one built on consensus) would do the opposite — see consistency & consensus.
Quorum tuning. W/R let each workload pick its point on the latency-vs-consistency curve; there's no globally right answer.
Conflict resolution. LWW is simple but loses writes; vector clocks + app merge preserve data at the cost of client complexity and sibling handling.
Leaderless simplicity vs reasoning difficulty. No leader means no failover and great availability, but eventual consistency pushes real complexity onto application developers.

takeaway

A Dynamo-style store is the textbook AP system: consistent hashing for partitioning, N-way replication with quorum-tunable consistency, and decentralized failure handling (hinted handoff, Merkle anti-entropy, gossip). Every choice trades strong consistency for availability and scale — so the right interview move is to state that trade explicitly and tie each mechanism back to it.

🎯 interview hot-takes

Why consistent hashing over hash mod N? Adding/removing a node remaps only a small slice of keys instead of nearly all of them; virtual nodes keep the spread even.
What does W + R > N guarantee? The read and write quorums overlap by at least one node, so a read sees the most recent write — the tunable-consistency knob.
How are concurrent writes handled without a leader? Vector clocks detect whether versions are causally ordered or concurrent; concurrent siblings are kept and merged (app logic or LWW).
Temporary vs permanent failure recovery? Hinted handoff covers transient outages (writes go to a stand-in with a hint); Merkle-tree anti-entropy repairs long-term divergence by transferring only differing keys.
Is it CP or AP? AP — it stays available under partitions and is eventually consistent.