In a monolith, when something breaks you read the logs and attach a debugger. In a distributed system of dozens of services — the kind every design on this site describes — a single user request fans out across many machines, and "why is this slow?" has no single place to look. Observability is the practice of instrumenting systems so you can answer questions about their behavior from the outside, including questions you didn't anticipate. Its foundation is the three pillars: metrics, logs, and traces. This is the operational flip side of the failures in our DDIA notes on distributed-systems trouble.
- Monitoring vs observability — monitoring watches known failure modes (dashboards/alerts you set up); observability lets you explore unknown ones after the fact.
- Three pillars: metrics, logs, traces. Metrics = cheap aggregatable numbers; logs = rich discrete events; traces = a request's path across services.
- Metrics drive dashboards and alerts (Prometheus pull model) — but watch cardinality: high-cardinality labels explode storage.
- Distributed tracing follows one request via a propagated trace ID across services, exposing where latency goes.
- Correlate the pillars with a shared trace ID — that's the real power, jumping metric → trace → logs.
- SLI/SLO/error budgets define "reliable enough"; alert on user-facing symptoms, not every internal cause, to avoid alert fatigue.
- OpenTelemetry is the vendor-neutral standard for emitting all three.
Observability instruments a system so you can understand its internal state from its outputs — crucial once requests span many services. Metrics (numeric time series) power dashboards and alerts; logs (discrete events) give detail; traces stitch a single request's journey across services via a propagated trace ID. Correlate the three to debug fast. Define reliability with SLIs/SLOs and error budgets, alert on symptoms not causes, and emit everything through OpenTelemetry. Mind cardinality and cost.
Monitoring vs Observability
The two are related but distinct. Monitoring is watching predefined signals for predefined problems — "alert me when CPU > 90% or error rate spikes." It answers known questions (known-unknowns). Observability is a property of the system: it has enough high-quality, queryable telemetry that you can ask new questions you never anticipated — "why are requests from this one customer, on this API version, slow only in the EU?" (unknown-unknowns). Monitoring tells you that something is wrong; observability helps you discover why, even for novel failures.
The Three Pillars
| Pillar | What it is | Good at / cost |
|---|---|---|
| Metrics | Numeric measurements over time (counters, gauges, histograms) | Cheap, aggregatable, great for alerts & trends; low detail |
| Logs | Timestamped discrete event records | Rich detail per event; expensive at high volume |
| Traces | The end-to-end path of one request across services | Shows where latency/errors occur; often sampled |
Metrics
Metrics are numbers measured over time — request rate, error count, latency percentiles, queue depth, CPU. They're cheap to store and fast to query because they're aggregated, which makes them ideal for dashboards and alerting. The common model is Prometheus, which pulls (scrapes) metrics from each service's /metrics endpoint at intervals, stored as time series labeled with dimensions; Grafana visualizes them.
http_requests_total{service="api", route="/checkout", status="500"} 4271
http_request_duration_seconds{service="api", quantile="0.99"} 0.842
alert: rate(http_requests_total{status=~"5.."}[5m]) / rate(all) > 0.01
Each unique combination of label values is a separate time series. Adding a high-cardinality label like user_id or request_id to a metric can create millions of series and blow up storage and query cost. Keep metric labels low-cardinality (service, route, status) — put high-cardinality identifiers in logs/traces, not metrics.
Logs
Logs are discrete, timestamped records of events — the richest, most detailed signal. The key practice is structured logging: emit logs as machine-parseable key-value/JSON (with fields like trace_id, user_id, latency_ms) rather than free-text strings, so they can be searched and aggregated. Logs are typically shipped to a search backend like Elasticsearch (the ELK/Elastic stack) or a log service. Their weakness is cost at scale — high-volume services generate enormous log data — so teams sample, set retention, and log judiciously on hot paths.
Traces
In a distributed system, a single request hops through many services, and neither metrics nor logs alone show that journey. Distributed tracing does: each request gets a unique trace ID that is propagated through every service it touches, and each unit of work records a span (with timing, parent span, and metadata). Assembled, the spans form a tree showing exactly where the request spent time and where it failed.
trace 7f3c… │ POST /checkout 320ms
├─ api-gateway ▇▇ 20ms
├─ order-service ▇▇▇▇▇▇ 90ms
│ └─ db: INSERT ▇▇▇ 55ms ◀ slow!
└─ payment-service ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 210ms ◀ the bottleneck
└─ external PSP ▇▇▇▇▇▇▇▇▇▇▇▇▇ 190ms
The trace immediately localizes the latency to the payment service's external call — something no dashboard would pinpoint. Because tracing every request is expensive, systems usually sample (keep a fraction, or use tail-based sampling to keep the interesting/slow/errored ones).
Correlation: the Real Power
The three pillars are far more useful together than apart. The connective tissue is a shared trace ID stamped onto metrics exemplars, logs, and spans alike. The debugging loop becomes: a metric alert shows error rate up → drill into a trace of a failing request to see which service broke → jump to that service's logs for that exact trace ID to read the error detail. Metric (what & when) → trace (where) → log (why), all linked. Wiring this correlation is the difference between observability that helps and three disconnected data silos.
SLIs, SLOs, SLAs, and Error Budgets
Observability needs a definition of "healthy," which the SRE vocabulary provides:
| Term | Meaning |
|---|---|
| SLI (indicator) | A measured signal of quality, e.g. % of requests < 200ms, or success rate |
| SLO (objective) | The target for an SLI, e.g. "99.9% of requests succeed over 30 days" |
| SLA (agreement) | A contractual promise to customers with penalties (usually looser than the SLO) |
| Error budget | 1 − SLO: the allowed unreliability (0.1%); spend it on releases, halt when exhausted |
The error budget is the clever bit: it turns reliability into a number you can spend. If you're within budget, ship fast; if you've burned it (too many recent failures), freeze risky changes and stabilize. It aligns dev velocity and reliability instead of pitting them against each other.
Alerting
The golden rule: alert on symptoms, not causes. Page a human when users are actually affected (error rate up, latency SLO burning), not on every internal blip (one node's CPU, a transient retry) that may self-heal. Cause-based alerts produce alert fatigue — so many pages that real ones get ignored. Tie alerts to SLO burn rate so urgency matches user impact. Google's "four golden signals" (latency, traffic, errors, saturation) are a good default set of what to watch.
OpenTelemetry
Historically each signal and vendor had its own agent and format. OpenTelemetry (OTel) is the now-standard, vendor-neutral framework for instrumenting code and emitting metrics, logs, and traces in a common format, exported to whatever backend you choose (Prometheus, Jaeger, a SaaS). It decouples instrumentation from the backend, so you instrument once and can switch vendors without re-instrumenting — which is why it's become the default.
Pitfalls
- Cardinality explosion — high-cardinality metric labels create millions of series; keep IDs out of metrics.
- Log cost — verbose logging at scale is expensive; sample, structure, and set retention.
- Dashboard sprawl — hundreds of unread dashboards; curate a few that map to SLOs.
- Alert fatigue — too many cause-based pages train people to ignore alerts; alert on symptoms.
- Uncorrelated signals — metrics, logs, and traces in separate silos with no shared trace ID make debugging slow.
Observability is how you debug systems too distributed to reason about by hand. Metrics tell you something's wrong and trend it cheaply; traces show where across services; logs explain why — and a shared trace ID links them. Define reliability with SLOs and error budgets, alert on user-facing symptoms to avoid fatigue, and standardize collection on OpenTelemetry while guarding against cardinality and cost.
Monitoring vs observability? Monitoring watches known failure modes (preset dashboards/alerts); observability lets you explore unknown ones after the fact from rich telemetry.
The three pillars? Metrics (cheap aggregatable numbers → alerts/trends), logs (detailed discrete events), traces (one request's path across services). Correlate via a shared trace ID.
What's the cardinality trap? High-cardinality metric labels (user_id, request_id) create huge numbers of time series and blow up cost — keep them in logs/traces instead.
SLI vs SLO vs SLA vs error budget? SLI = measured signal, SLO = internal target, SLA = customer contract, error budget = 1−SLO, the unreliability you're allowed to spend.
What should you alert on? User-facing symptoms (SLO burn, error rate), not every internal cause — to avoid alert fatigue.