Chapter 5 opens Part II of DDIA, "Distributed Data." Replication means keeping a copy of the same data on multiple machines connected via a network. We do it for three reasons: to keep data geographically close to users (lower latency), to keep the system working even when some parts fail (availability), and to scale out the number of machines serving read queries (throughput). If the data never changed, replication would be trivial — copy it once and you're done. The entire difficulty, and the whole chapter, is about handling changes to replicated data. Kleppmann frames everything around three algorithms: single-leader, multi-leader, and leaderless replication.

⚡ Quick Takeaways
  • Three approaches — single-leader (one node accepts writes), multi-leader (several do), and leaderless (clients write to many nodes directly). Almost every distributed datastore is a variation on these.
  • Sync vs async is a durability/availability trade — synchronous replication guarantees an up-to-date copy but blocks if a follower is down; asynchronous is fast but can lose recently-acknowledged writes on failover.
  • Failover is dangerous — promoting a new leader risks lost writes, split brain (two leaders), and bad timeout tuning. Automatic failover is genuinely hard to get right.
  • Replication lag breaks intuitive guarantees — you need read-after-write, monotonic reads, and consistent-prefix reads to paper over an asynchronous follower that has fallen behind.
  • Multi-leader's core problem is write conflicts — the same record edited on two leaders must be reconciled (LWW, CRDTs, custom merge).
  • Leaderless uses quorums — if w + r > n, read and write sets overlap, so a read sees the latest write. Read repair and anti-entropy heal stale replicas.
tldr

Single-leader replication is simple and the default (most relational DBs): writes go to a leader, which streams a change log to followers. Multi-leader adds write availability across datacenters at the cost of conflict resolution. Leaderless (Dynamo-style) drops the leader entirely and relies on quorums plus repair. The recurring theme is the tension between consistency and availability when the network is unreliable and replication is asynchronous.

Single-Leader Replication

The most common approach, used by PostgreSQL, MySQL, SQL Server, MongoDB, and many others, is leader-based replication (also called active/passive or master/slave). One replica is designated the leader. All writes must go to the leader, which first writes the new data to its own local storage. The other replicas are followers: whenever the leader writes, it also sends the change to every follower as part of a replication log or change stream. Each follower applies the changes in the same order the leader processed them. Reads can be served by either the leader or any follower, but writes are leader-only.

Synchronous vs Asynchronous Replication

A crucial configuration knob is whether replication happens synchronously or asynchronously:

Making every follower synchronous is impractical — a single slow node would halt the whole system. A common compromise is semi-synchronous: one follower is synchronous, and if it gets too slow, another follower is made synchronous in its place. This guarantees at least two nodes have the latest data. In practice, many leader-based systems run fully asynchronous, accepting the risk of losing recent writes in exchange for performance and availability.

Setting Up New Followers

To add a follower without downtime, you can't just copy files while writes are happening. The standard procedure: take a consistent snapshot of the leader's database (most DBs support this without locking), copy it to the new follower, then have the follower connect to the leader and request all changes that happened since the snapshot — identified by an exact position in the replication log (PostgreSQL calls it a log sequence number, MySQL a binlog coordinate). Once the follower has caught up, it processes changes live.

Handling Node Outages

Any node can go down, and a goal of replication is to keep the system running through individual failures. How you recover depends on which node failed.

Follower Failure: Catch-Up Recovery

This is the easy case. Each follower keeps a log of the changes it has received. After a crash or a network blip, it knows the last transaction it processed; it reconnects to the leader and requests every change since then, applies them, and is back in sync. No human intervention needed.

Leader Failure: Failover

This is the hard case. If the leader fails, one of the followers must be promoted to be the new leader, clients must be reconfigured to send writes to it, and the other followers must start consuming changes from it. This process — called failover — can be manual or automatic, and it is riddled with pitfalls:

why this matters

There are no easy answers to any of these failover problems — which is exactly why many operations teams prefer to perform failover manually, even though that means a longer outage. The deep reason these issues are so thorny is the unreliability of networks, clocks, and the difficulty of agreeing on a single truth — the subjects of Chapters 8 and 9.

Implementation of Replication Logs

How does the leader actually communicate changes to followers? There are several methods, each with trade-offs.

Problems with Replication Lag

Reading from asynchronous followers lets you scale reads cheaply, but a follower may lag behind the leader. If you read from a lagging follower, you may see stale data — the system is eventually consistent, but "eventually" is deliberately vague and could be seconds or minutes under load. Three specific anomalies, and the guarantees that fix them, come up constantly in interviews:

Read-After-Write Consistency

A user submits a change, then reloads — and if their read hits a follower that hasn't received the write yet, their own update appears to have vanished. Read-after-write (or read-your-writes) consistency guarantees a user always sees their own writes. Techniques: read things the user may have modified from the leader; track the timestamp of the user's last write and only read from a follower that's caught up to it.

Monotonic Reads

If a user makes several reads from different replicas, they might see a value, then a moment later see an older value — time appearing to move backwards. Monotonic reads guarantees that once you've read newer data, you won't later read older data. A simple implementation: each user always reads from the same replica (e.g. chosen by a hash of their user ID).

Consistent Prefix Reads

If a sequence of writes happens in a certain order, anyone reading them should see them in that same order. Violations are jarring — you might see an answer to a question before the question itself. Consistent prefix reads guarantees causally-related writes are read in order. This is especially a problem in partitioned (sharded) databases where different partitions replicate at different speeds.

key takeaway

"Eventual consistency" is a weak guarantee that pushes a lot of complexity onto application developers. Rather than hoping the lag stays small and being surprised when it doesn't, it's better to think in terms of these specific guarantees — and to recognize that providing stronger guarantees is exactly what transactions and consensus (later chapters) are for.

Multi-Leader Replication

Single-leader has one obvious downside: all writes funnel through one node. Multi-leader replication allows more than one node to accept writes; each leader simultaneously acts as a follower to the other leaders. The main use cases:

Write Conflicts

The big problem multi-leader introduces is write conflicts: the same data can be modified on two different leaders concurrently, and the conflict isn't detected until the changes are asynchronously merged. Approaches:

Replication Topologies

With more than two leaders, you must decide the path writes take between them. All-to-all (every leader sends to every other) is the most robust but can suffer causality problems if some links are faster than others. Circular and star topologies reduce traffic but introduce a single point of failure and can break if a node goes down. To prevent infinite loops, each write is tagged with the identifiers of the nodes it has already passed through.

Leaderless Replication

The third approach abandons the concept of a leader entirely: any replica can accept writes directly from clients. This style was popularized by Amazon's Dynamo and is used by Cassandra, Riak, and Voldemort (often called Dynamo-style). The client (or a coordinator node acting on its behalf) sends each write to several replicas and reads from several replicas in parallel.

Quorum Reads and Writes

To know that a read sees the latest write despite some nodes being down, leaderless systems use quorums. With n replicas, a write must be confirmed by w nodes and a read must query r nodes. As long as w + r > n, the set of nodes written and the set read are guaranteed to overlap by at least one node — so a read will see at least one up-to-date copy.

quorum condition
n = 3 replicas

  w + r > n   guarantees the write set and read set overlap

  example:  w = 2, r = 2,  n = 3
            2 + 2 = 4 > 3   ✓  every read sees the latest write

  tune for reads:   w = 3, r = 1   (fast reads, slow/fragile writes)
  tune for writes:  w = 1, r = 3   (fast writes, must read widely)

Keeping Replicas in Sync

When a down node comes back, it has missed writes. Two mechanisms heal this:

Limitations of Quorums and Detecting Concurrent Writes

Quorums are not airtight. Sloppy quorums (with hinted handoff) let writes go to other reachable nodes when the "home" nodes are unavailable, increasing availability but weakening the overlap guarantee. And because multiple clients can write concurrently to a leaderless store, conflicts arise just like in multi-leader. Detecting them requires reasoning about the "happens-before" relationship: two operations are concurrent if neither knew about the other. Systems use version numbers per key, or version vectors across replicas, to tell whether one write supersedes another or whether they're truly concurrent (in which case the conflicting values, called siblings, must be merged by the application).

DimensionSingle-leaderMulti-leaderLeaderless
Who accepts writesOne nodeSeveral nodesAny replica
Write conflictsNone (serialized)Yes — must resolveYes — must resolve
Handling node lossFailover (hard)Other leaders continueQuorum tolerates loss
ConsistencyStrong-ish (if read leader)EventualEventual (tunable quorum)
ExamplesPostgreSQL, MySQLMulti-DC, CouchDBCassandra, Riak, Dynamo
takeaway

Replication is the foundation of every distributed datastore, and the three strategies are a spectrum of trade-offs between write availability and the cost of resolving conflicts. Single-leader avoids conflicts but bottlenecks and fails awkwardly; multi-leader and leaderless buy availability by forcing you to reconcile concurrent writes. Knowing where a given database sits — and what it does about lag and conflicts — is the heart of a strong distributed-systems interview answer.

🎯 interview hot-takes

Sync vs async replication? Sync guarantees a durable up-to-date copy but blocks on a slow follower; async is fast but can lose acknowledged writes on failover. Semi-sync (one sync follower) is the common compromise.
Why is failover hard? Lost async writes, split brain (two leaders), bad timeout tuning, and dangerous interactions with external systems — which is why many teams do it manually.
What does w + r > n give you? The read and write quorums overlap by at least one node, so a read is guaranteed to see the most recent write — the core leaderless consistency knob.
How do you detect concurrent writes? Track the happens-before relation with version numbers or version vectors; writes where neither saw the other are concurrent siblings that need merging.

← previous
Encoding & Evolution