Chapter 4 closes Part I of DDIA by tackling a problem that every long-lived system faces: applications change, and when code changes, the shape of the data it produces changes too. New features add fields, refactors rename them, and old features get removed. Meanwhile the data already written to disk and in flight over the network doesn't magically update itself. This chapter is about how data is encoded (turned into bytes), how those encodings can evolve without breaking running systems, and the three ways data flows between processes — through databases, through services, and through message brokers.

⚡ Quick Takeaways
  • Two directions of compatibilitybackward: new code can read old data (usually easy); forward: old code can read new data (harder — it must ignore fields it doesn't understand).
  • Rolling upgrades force both at once — during a staged deploy, old and new versions run side by side, so the format must be readable in both directions simultaneously.
  • Language-specific serialization is a trap — Java Serializable, Python pickle, and friends bring lock-in, security holes, poor versioning, and bloat. Avoid for anything persisted or shared.
  • JSON/XML/CSV are ubiquitous but sloppy — ambiguous numbers, no binary strings, optional schemas. Great for openness, weak for precision and size.
  • Schema-based binary formats win at scale — Protobuf and Thrift use numbered field tags; Avro matches a writer's schema against a reader's schema. Both encode evolution rules explicitly.
  • "Data outlives code" — your encoding choice decides how painful it is to evolve a system that's already running in production.
tldr

Encoding turns in-memory objects into bytes; decoding reverses it. Because old and new code coexist during rolling upgrades, formats must be both backward- and forward-compatible. Schema-based binary formats (Protobuf, Thrift, Avro) make evolution safe and compact. Data flows three ways — databases (writer and reader separated by time), services (REST/RPC, separated by the network), and message brokers (asynchronous, decoupled) — and every one of them is an encoding boundary.

Why Encoding Matters

Programs work with data in (at least) two different representations. In memory, data lives in objects, structs, lists, arrays, hash tables, and trees — structures optimized for efficient access and manipulation by the CPU, full of pointers. To write data to a file or send it over the network, you must translate it into a self-contained sequence of bytes — there are no pointers in a byte stream that another process can follow. The translation from the in-memory representation to a byte sequence is called encoding (also serialization or marshalling), and the reverse is decoding (parsing, deserialization, unmarshalling).

Because this happens constantly — every database write, every API call, every message — the choice of encoding has outsized effects on efficiency, and, crucially, on how easily the system can change over time.

Rolling Upgrades and Coexisting Versions

Large applications are rarely deployed all at once. Server-side systems use a rolling upgrade (staged rollout): a few nodes get the new version, you check nothing is broken, then you continue until all nodes are updated. Client-side apps are at the mercy of users who may not update for weeks. The consequence is unavoidable: old and new versions of the code, and old and new data formats, all coexist in the system at the same time.

For the system to keep running smoothly, compatibility has to hold in both directions:

key distinction

Backward compatibility looks back in time: new code reading old data. Forward compatibility looks forward: old code reading data from the future. The tricky one is forward compatibility — it requires the format and the code to gracefully skip unknown fields. Schema-based formats build this in; ad-hoc parsing usually doesn't.

Language-Specific Formats

Many languages ship built-in serialization: Java has java.io.Serializable, Python has pickle, Ruby has Marshal. They're tempting because they let you save and restore in-memory objects with minimal code. But Kleppmann is blunt about why they're a bad choice for anything beyond throwaway use:

The verdict: language-specific formats are fine for transient, same-process, same-version use, and a liability for anything you persist or send across a boundary.

Textual Formats: JSON, XML, CSV

The standardized, language-independent encodings most developers reach for are JSON, XML, and CSV. They're human-readable, supported everywhere, and great for openness. They also have real flaws that bite at scale:

Despite all this, textual formats remain an excellent default for many purposes — for public APIs and human-facing data, ubiquity and tooling usually outweigh the inefficiency. The problems matter most when you're encoding huge volumes or need precise, compact, evolvable data internally.

Binary Encoding

For data used only inside your organization, you can choose a format that's far more compact and faster to parse. A first step is a binary encoding of JSON — formats like MessagePack, BSON, and others. These shrink the data somewhat, but because they don't have a schema, they must still include all the object's field names in the encoded bytes. That's the key inefficiency the next family of formats removes.

The insight: if both the writer and the reader agree on a schema ahead of time, the field names never have to travel with the data. You can replace each field name with a compact numeric tag, and you gain type information for free. This is the foundation of Thrift, Protocol Buffers, and Avro.

Thrift and Protocol Buffers

Apache Thrift (originally from Facebook) and Protocol Buffers (Protobuf, from Google) are closely related binary encoding libraries. Both require the data to be described by a schema written in an interface definition language (IDL), and both ship a code-generation tool that produces classes in many languages from that schema.

person.proto — Protocol Buffers schema
message Person {
  required string user_name       = 1;   // tag 1
  optional int64  favorite_number = 2;   // tag 2 — safe to add later
  repeated string interests       = 3;   // 0..n values
}

Field Tags and the Encoding

The crucial detail is that each field has a numeric tag (the = 1, = 2, = 3). In the encoded bytes, the tag — not the field name — identifies the field, along with the field's type and value. Field names exist only in the schema, so they cost nothing at runtime. This is what makes the encoding compact.

Schema Evolution Rules

Because tags carry the meaning, the rules for changing a schema safely fall out naturally:

Thrift and Protobuf differ in details — Thrift offers several encoding flavors (BinaryProtocol, CompactProtocol) and a richer set of container types — but the field-tag mechanism and the evolution rules are essentially the same.

Avro

Apache Avro (born from the Hadoop ecosystem) takes a different approach. It also uses a schema, but the encoded bytes contain nothing but values — no tag numbers, no field names, no type annotations. That makes Avro encodings the most compact of the three, but it raises an obvious question: how does the reader know what the bytes mean?

person.avsc — Avro schema (JSON form)
{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName",       "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
    {"name": "interests",      "type": {"type": "array", "items": "string"}}
  ]
}

Writer's Schema vs Reader's Schema

Avro's answer is the key idea of the chapter. When data is encoded, it's encoded with the writer's schema — whatever version the producing code had. When data is decoded, the reader expects a reader's schema — whatever version the consuming code has. These two schemas need not be identical; they only need to be compatible. The Avro library resolves the differences by looking at both schemas side by side:

This is why every field you add or remove must have a default — that default is exactly what lets schema resolution paper over the version gap, giving both backward and forward compatibility.

How the Reader Learns the Writer's Schema

The reader needs the writer's schema to decode. Avro handles this differently per context:

why avro shines

Because Avro has no tag numbers and matches fields by name, it's ideal when schemas are generated dynamically — for example, auto-derived from a relational database's columns. If the DB schema changes, you just generate a new Avro schema; there are no tag numbers to assign by hand and no risk of accidentally reusing one. That property is a big reason Avro is popular in Hadoop, Kafka, and data-pipeline tooling.

The Merits of Schemas

Stepping back, the schema-based binary formats share a set of advantages that explain why they dominate at scale, even though they're less convenient than dumping JSON:

AspectTextual (JSON/XML)Schema binary (Protobuf/Thrift/Avro)
ReadabilityHuman-readableOpaque bytes (need schema)
SizeVerbose; field names repeatCompact; names dropped
SchemaOptional, often skippedRequired and enforced
Field identityBy name in the dataBy tag (PB/Thrift) or schema (Avro)
EvolutionAd-hoc, error-proneExplicit rules, checkable
Best forOpen/public APIs, debuggingHigh-volume internal data

Modes of Dataflow

The second half of the chapter zooms out: whenever you send data to another process that doesn't share your memory, you encode it. There are three main modes of dataflow, and each is a place where compatibility matters.

Dataflow Through Databases

With a database, the process that writes encodes the data and the process that reads decodes it. Those two processes may be the same application at different times — so in a sense you are sending a message to your future self. Backward compatibility is clearly needed (future code must read what past code wrote). But forward compatibility matters too, and in a sticky way: during a rolling upgrade, new code may write a record with a new field, and then old code may read that record, modify it, and write it back. If the old code doesn't understand the new field, the danger is that it drops the field it didn't recognize — silently losing data. The fix is for the format and the code to preserve unknown fields on a round trip.

Kleppmann's slogan for this section is "data outlives code." You might deploy a new version of your code in minutes, but the data in your database can be years old. Rewriting (migrating) every old record to a new schema is expensive, so most databases instead allow simple schema changes — adding a column with a null default, for example — and decode old rows on the fly. LinkedIn's document store Espresso uses Avro precisely to get these evolution properties.

Dataflow Through Services: REST and RPC

When processes communicate over a network, the common arrangement is clients and servers: servers expose an API, clients call it. The web works this way (browsers and web servers), and server-side applications are increasingly decomposed into smaller services that call each other — service-oriented or microservices architecture. A key goal is that services can be deployed and evolved independently, which means old and new versions of clients and servers must interoperate — the same compatibility problem again.

Two broad philosophies for web services:

The Problems with RPC

Remote Procedure Call (RPC) frameworks try to make a network request look like calling a local function in your own process (this is called location transparency). Kleppmann argues this abstraction is fundamentally flawed, because a network request differs from a local call in ways you can't paper over:

interview-grade point

This is the chapter's connection to the "fallacies of distributed computing." Pretending the network is reliable, fast, and homogeneous — the lie that location-transparent RPC tells — is precisely what causes systems to behave badly in production. A good answer names the difference: a remote call can time out with an unknown outcome, which a local call never does, and that uncertainty drives the need for idempotency, retries, and timeouts.

Modern RPC frameworks are more honest about this: gRPC (built on Protobuf), Thrift, Finagle, and Avro RPC expose the asynchronous nature with futures/promises and streams, and add service discovery. RPC is still a fine fit for requests between services owned by the same organization, typically within one datacenter. For API evolution, services have it easier than databases: you can often update all the servers before the clients, so it's reasonable to maintain compatibility only across a few versions, signalling the version in the URL or an HTTP header.

Message-Passing Dataflow

The third mode sits between RPC and databases: asynchronous message passing via a message broker (RabbitMQ, ActiveMQ, Kafka, NATS, and others). A sender (producer) posts a message to a named queue or topic; the broker stores it and delivers it to one or more consumers. Like RPC, the message goes to another process with low latency; like a database, it goes through an intermediary that holds the data temporarily. The advantages over direct RPC:

Because messages are just byte sequences with some metadata, all the same encoding and compatibility concerns apply — and the asynchronous, decoupled nature actually makes forward/backward compatibility more important, since producers and consumers are deployed and upgraded entirely independently.

Distributed Actor Frameworks

A related model is the actor model: concurrency is expressed as actors — independent entities that hold local state and communicate only by sending each other asynchronous messages, sidestepping shared-memory threading problems. In a distributed actor framework (Akka, Microsoft Orleans, Erlang OTP), this message-passing model is extended transparently across nodes; because messages are already the unit of communication, the same framework scales an application from one machine to many. The catch returns one last time: when you do a rolling upgrade of an actor-based application, you must still ensure messages encoded by one version can be decoded by another.

takeaway

Encoding is where the abstract idea of "evolving a system" meets concrete bytes. Pick a format that makes compatibility explicit (schema-based binary for high-volume internal data; JSON for open APIs), and remember that databases, services, and message brokers are all encoding boundaries where old and new code meet. Get this right and you can change a running system fearlessly; get it wrong and every deploy risks silently corrupting or dropping data.

🎯 interview hot-takes

Backward vs forward compatibility? Backward = new code reads old data (easy). Forward = old code reads new data (hard; it must ignore unknown fields). Rolling upgrades need both at once.
Why are field tags so important in Protobuf/Thrift? The numeric tag, not the name, identifies a field in the bytes — that's what keeps it compact and what makes the evolution rules (add optional fields with new tags, never reuse a tag) safe.
What's special about Avro? It matches a writer's schema against a reader's schema by field name, with defaults filling gaps — no tag numbers — which is perfect for dynamically generated schemas like Kafka + Schema Registry.
Why is location-transparent RPC criticized? A network call can time out with an unknown result, unlike a local call; pretending otherwise ignores partial failure and forces you to design for idempotency, retries, and timeouts.

← previous
Storage & Retrieval