DDIA Ch.11 — Stream Processing

Chapter 10 assumed the input was a known, finite dataset. But in reality data arrives continuously and never ends: user clicks, sensor readings, payments, log lines. Waiting to accumulate a day's worth before processing introduces a day of latency. Stream processing takes the immutable-events ideas of batch processing and applies them to unbounded, incrementally-arriving data, processing each event soon after it happens. This chapter covers how event streams are transmitted (message brokers and logs), how databases and streams relate (CDC and event sourcing), and how to actually compute over streams (time, windows, joins, and fault tolerance).

⚡ Quick Takeaways

An event is a small, immutable record of something that happened, with a timestamp. Producers append events; consumers process them; related events form a topic/stream.
Two broker styles — traditional queues (AMQP/JMS) delete a message once acked and load-balance work; log-based brokers (Kafka) keep an append-only, replayable log and preserve order per partition.
Change data capture (CDC) turns a database's write log into a stream so derived systems (indexes, caches, warehouses) stay in sync — the DB is the leader, derivatives are followers.
Event sourcing stores all changes as an immutable event log that is the source of truth; current state is derived by replaying events.
Event time vs processing time is the central headache — events arrive late and out of order, so "is this window complete?" has no certain answer.
Stream joins and fault tolerance — stream-stream, stream-table, and table-table joins; exactly-once via checkpointing, idempotence, or atomic commit.

tldr

Streams are unbounded sequences of immutable events. Log-based brokers like Kafka (append-only, retained, replayable) generalize both messaging and database replication: change data capture and event sourcing both treat the log of changes as primary and derive state from it. Computing over streams forces you to confront time (event vs processing time, windows), to join streams against tables kept current via CDC, and to achieve exactly-once results despite an input that never ends.

Transmitting Event Streams

In stream processing, the equivalent of a "record" is an event: a small, self-contained, immutable object recording something that happened, usually with a timestamp. An event is generated once by a producer (publisher) and potentially processed by multiple consumers (subscribers); related events are grouped into a topic or stream. You could poll a datastore for new events, but polling frequently is expensive, so it's better for consumers to be notified when new events appear — which is what messaging systems provide.

Message Brokers: Queues vs Logs

You can send events directly from producer to consumer (e.g. UDP multicast, or brokerless libraries like ZeroMQ), but most systems use a message broker as a durable intermediary that buffers events. There are two fundamentally different broker designs:

Aspect	Traditional queue (AMQP/JMS)	Log-based (Kafka)
Storage model	Message deleted once acked	Append-only log, retained
Consumption	Assigned to one consumer, load-balanced	Consumers track an offset
Ordering	Lost on redelivery	Preserved within a partition
Replay	No — gone after processing	Yes — re-read from any offset
Best for	Task queues, slow/variable jobs	High-throughput, ordered, replayable

Traditional message queues assign each message to a consumer and delete it once acknowledged; multiple consumers share the load, but order isn't preserved across redeliveries. They suit job queues where each message triggers some (possibly slow) work. Log-based brokers like Kafka store messages in a partitioned, append-only log on disk; a consumer simply records its offset (position in the log), and messages are retained even after being read. This lets you replay old events, gives high throughput and ordering within a partition, and makes the broker behave a lot like a database's replication log — a connection the rest of the chapter exploits.

Databases and Streams

The deep insight of the chapter: a replication log is a stream of events, and that equivalence lets us keep many systems in sync. Applications often need to write the same data to several places (a database, a search index, a cache, a warehouse). Doing this with dual writes from the application is error-prone — concurrent writes race and a partial failure leaves systems inconsistent. Streams offer a cleaner way.

Change Data Capture (CDC)

Change data capture observes all the writes to a database — typically by tailing its replication log — and makes them available as a stream of change events. Downstream systems (search index, cache, data warehouse) consume that stream and apply the same changes in order. The original database is effectively the leader and the derived systems are followers, so they end up consistent (eventually). Tools like Debezium implement CDC against common databases. Because consumers may need to bootstrap, the stream can be combined with an initial snapshot, and log compaction (Kafka keeps only the latest event per key) lets a consumer rebuild full state from the log alone.

a change stream — one event per write

# CDC: each DB write becomes an immutable, ordered event
offset 1001  {"op":"insert", "table":"cart", "key":42, "qty":1}
offset 1002  {"op":"update", "table":"cart", "key":42, "qty":3}
offset 1003  {"op":"delete", "table":"cart", "key":42}

  index, cache, and warehouse each consume this log in order
  → all stay in sync with the database, no fragile dual writes.

Event Sourcing

A related idea from the domain-driven-design community: event sourcing stores all changes to application state as an immutable, append-only log of events — and those events, not a mutable "current state" table, are the source of truth. Current state is derived by replaying the events. The distinction from CDC: event-sourcing events are modeled at the application level (a meaningful action like "student cancelled enrollment"), capturing intent, whereas CDC events are low-level row changes. The benefits of treating the log as primary are large: you can rebuild any derived view, debug by replaying history, and evolve how you interpret events over time. A useful discipline is separating commands (requests that may be rejected) from events (facts that already happened and are immutable).

the unifying idea

Immutable events plus derived state is the same principle as batch processing's immutable inputs — just continuous. The truth is an ordered log of things that happened; databases, indexes, and caches are all materialized views of that log. This reframing (the "turning the database inside out" idea) is the bridge to Chapter 12.

Processing Streams

Once you have streams, what do you compute? Common uses: complex event processing (search for patterns of events, e.g. fraud detection), stream analytics (rolling aggregates and rates over time windows), and maintaining materialized views (keeping a derived dataset continuously up to date).

Reasoning About Time

The thorniest issue in stream processing is time. There are two clocks: the event time (when the event actually occurred) and the processing time (when the stream processor handled it). They diverge because of network delays, buffering, broker backlogs, and consumer restarts. If you aggregate by processing time, a consumer that falls behind and catches up will produce a misleading spike. Aggregating by event time is more correct but raises a hard question: "is the window complete?" — you can never be fully sure that no more straggler events for a past time window will arrive. Systems handle stragglers by declaring a window closed after a timeout (and either ignoring or correcting for late events).

Windows and Joins

Aggregations over an unbounded stream operate on windows: tumbling (fixed, non-overlapping), hopping (fixed, overlapping), sliding (events within an interval of each other), and session windows (grouped by activity bursts). Joining streams is subtler than batch joins because both sides keep moving:

Stream-stream (window) join — match events from two streams that occur close in time (e.g. a search query and the click that follows). The processor buffers recent events in a window.
Stream-table (enrichment) join — augment each event with data from a table (e.g. add user profile to an activity event). The processor keeps a local copy of the table, kept current via CDC from the database.
Table-table join — maintain a materialized view that is itself the join of two changing tables, updated as either side changes.

A recurring complication: because input order and timing aren't deterministic, the result of a stream join can depend on when each side's data arrived.

Fault Tolerance and Exactly-Once

Batch jobs tolerate faults by rerunning a task on immutable input — but a stream is unbounded, so you can't just "rerun from the start." Instead, stream processors use microbatching (Spark Streaming splits the stream into tiny batches) or periodic checkpointing of state (Flink) so a failed job resumes from the last checkpoint. The goal is exactly-once semantics (more precisely, effectively once): each event affects the output as if processed exactly one time, even across failures and retries. This is achieved with some combination of idempotent operations, atomic commit of output-plus-offset, and discarding work since the last checkpoint.

takeaway

Stream processing is batch processing's continuous twin, unified by the idea of an immutable, ordered event log as the source of truth. Log-based brokers (Kafka) made this practical by being replayable and durable. The genuinely hard parts are uniquely streaming: reconciling event time with processing time, deciding when a window is "done," and getting exactly-once results over an input that never ends.

🎯 interview hot-takes

Log-based broker vs traditional queue? A queue deletes messages once acked and load-balances; a log (Kafka) retains an append-only, ordered, replayable record that consumers read by offset.
CDC vs event sourcing? Both treat a change log as primary; CDC captures low-level row changes from the DB log, event sourcing records application-level events as the source of truth and derives state by replay.
Event time vs processing time? Event time = when it happened; processing time = when it was handled. They diverge with delays, so windowed aggregates must handle late "straggler" events and the "is the window complete?" problem.
How do streams achieve exactly-once? Checkpointing/microbatching plus idempotence and atomic commit of output-and-offset, so retries don't double-count.