In a monolith, when something breaks you read the logs and attach a debugger. In a distributed system of dozens of services — the kind every design on this site describes — a single user request fans out across many machines, and "why is this slow?" has no single place to look. Observability is the practice of instrumenting systems so you can answer questions about their behavior from the outside, including questions you didn't anticipate. Its foundation is the three pillars: metrics, logs, and traces. This is the operational flip side of the failures in our DDIA notes on distributed-systems trouble.

⚡ Quick Takeaways
  • Monitoring vs observability — monitoring watches known failure modes (dashboards/alerts you set up); observability lets you explore unknown ones after the fact.
  • Three pillars: metrics, logs, traces. Metrics = cheap aggregatable numbers; logs = rich discrete events; traces = a request's path across services.
  • Metrics drive dashboards and alerts (Prometheus pull model) — but watch cardinality: high-cardinality labels explode storage.
  • Distributed tracing follows one request via a propagated trace ID across services, exposing where latency goes.
  • Correlate the pillars with a shared trace ID — that's the real power, jumping metric → trace → logs.
  • SLI/SLO/error budgets define "reliable enough"; alert on user-facing symptoms, not every internal cause, to avoid alert fatigue.
  • OpenTelemetry is the vendor-neutral standard for emitting all three.
tldr

Observability instruments a system so you can understand its internal state from its outputs — crucial once requests span many services. Metrics (numeric time series) power dashboards and alerts; logs (discrete events) give detail; traces stitch a single request's journey across services via a propagated trace ID. Correlate the three to debug fast. Define reliability with SLIs/SLOs and error budgets, alert on symptoms not causes, and emit everything through OpenTelemetry. Mind cardinality and cost.

Monitoring vs Observability

The two are related but distinct. Monitoring is watching predefined signals for predefined problems — "alert me when CPU > 90% or error rate spikes." It answers known questions (known-unknowns). Observability is a property of the system: it has enough high-quality, queryable telemetry that you can ask new questions you never anticipated — "why are requests from this one customer, on this API version, slow only in the EU?" (unknown-unknowns). Monitoring tells you that something is wrong; observability helps you discover why, even for novel failures.

The Three Pillars

PillarWhat it isGood at / cost
MetricsNumeric measurements over time (counters, gauges, histograms)Cheap, aggregatable, great for alerts & trends; low detail
LogsTimestamped discrete event recordsRich detail per event; expensive at high volume
TracesThe end-to-end path of one request across servicesShows where latency/errors occur; often sampled

Metrics

Metrics are numbers measured over time — request rate, error count, latency percentiles, queue depth, CPU. They're cheap to store and fast to query because they're aggregated, which makes them ideal for dashboards and alerting. The common model is Prometheus, which pulls (scrapes) metrics from each service's /metrics endpoint at intervals, stored as time series labeled with dimensions; Grafana visualizes them.

a metric with labels (Prometheus style)
http_requests_total{service="api", route="/checkout", status="500"}  4271
http_request_duration_seconds{service="api", quantile="0.99"}       0.842

  alert: rate(http_requests_total{status=~"5.."}[5m]) / rate(all) > 0.01
the cardinality trap

Each unique combination of label values is a separate time series. Adding a high-cardinality label like user_id or request_id to a metric can create millions of series and blow up storage and query cost. Keep metric labels low-cardinality (service, route, status) — put high-cardinality identifiers in logs/traces, not metrics.

Logs

Logs are discrete, timestamped records of events — the richest, most detailed signal. The key practice is structured logging: emit logs as machine-parseable key-value/JSON (with fields like trace_id, user_id, latency_ms) rather than free-text strings, so they can be searched and aggregated. Logs are typically shipped to a search backend like Elasticsearch (the ELK/Elastic stack) or a log service. Their weakness is cost at scale — high-volume services generate enormous log data — so teams sample, set retention, and log judiciously on hot paths.

Traces

In a distributed system, a single request hops through many services, and neither metrics nor logs alone show that journey. Distributed tracing does: each request gets a unique trace ID that is propagated through every service it touches, and each unit of work records a span (with timing, parent span, and metadata). Assembled, the spans form a tree showing exactly where the request spent time and where it failed.

a trace = spans across services (trace_id propagated)
trace 7f3c… │ POST /checkout                              320ms
            ├─ api-gateway        ▇▇                        20ms
            ├─ order-service      ▇▇▇▇▇▇                    90ms
            │   └─ db: INSERT     ▇▇▇                       55ms  ◀ slow!
            └─ payment-service    ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇          210ms  ◀ the bottleneck
                └─ external PSP   ▇▇▇▇▇▇▇▇▇▇▇▇▇            190ms

The trace immediately localizes the latency to the payment service's external call — something no dashboard would pinpoint. Because tracing every request is expensive, systems usually sample (keep a fraction, or use tail-based sampling to keep the interesting/slow/errored ones).

Correlation: the Real Power

The three pillars are far more useful together than apart. The connective tissue is a shared trace ID stamped onto metrics exemplars, logs, and spans alike. The debugging loop becomes: a metric alert shows error rate up → drill into a trace of a failing request to see which service broke → jump to that service's logs for that exact trace ID to read the error detail. Metric (what & when) → trace (where) → log (why), all linked. Wiring this correlation is the difference between observability that helps and three disconnected data silos.

SLIs, SLOs, SLAs, and Error Budgets

Observability needs a definition of "healthy," which the SRE vocabulary provides:

TermMeaning
SLI (indicator)A measured signal of quality, e.g. % of requests < 200ms, or success rate
SLO (objective)The target for an SLI, e.g. "99.9% of requests succeed over 30 days"
SLA (agreement)A contractual promise to customers with penalties (usually looser than the SLO)
Error budget1 − SLO: the allowed unreliability (0.1%); spend it on releases, halt when exhausted

The error budget is the clever bit: it turns reliability into a number you can spend. If you're within budget, ship fast; if you've burned it (too many recent failures), freeze risky changes and stabilize. It aligns dev velocity and reliability instead of pitting them against each other.

Alerting

The golden rule: alert on symptoms, not causes. Page a human when users are actually affected (error rate up, latency SLO burning), not on every internal blip (one node's CPU, a transient retry) that may self-heal. Cause-based alerts produce alert fatigue — so many pages that real ones get ignored. Tie alerts to SLO burn rate so urgency matches user impact. Google's "four golden signals" (latency, traffic, errors, saturation) are a good default set of what to watch.

OpenTelemetry

Historically each signal and vendor had its own agent and format. OpenTelemetry (OTel) is the now-standard, vendor-neutral framework for instrumenting code and emitting metrics, logs, and traces in a common format, exported to whatever backend you choose (Prometheus, Jaeger, a SaaS). It decouples instrumentation from the backend, so you instrument once and can switch vendors without re-instrumenting — which is why it's become the default.

Pitfalls

takeaway

Observability is how you debug systems too distributed to reason about by hand. Metrics tell you something's wrong and trend it cheaply; traces show where across services; logs explain why — and a shared trace ID links them. Define reliability with SLOs and error budgets, alert on user-facing symptoms to avoid fatigue, and standardize collection on OpenTelemetry while guarding against cardinality and cost.

🎯 interview hot-takes

Monitoring vs observability? Monitoring watches known failure modes (preset dashboards/alerts); observability lets you explore unknown ones after the fact from rich telemetry.
The three pillars? Metrics (cheap aggregatable numbers → alerts/trends), logs (detailed discrete events), traces (one request's path across services). Correlate via a shared trace ID.
What's the cardinality trap? High-cardinality metric labels (user_id, request_id) create huge numbers of time series and blow up cost — keep them in logs/traces instead.
SLI vs SLO vs SLA vs error budget? SLI = measured signal, SLO = internal target, SLA = customer contract, error budget = 1−SLO, the unreliability you're allowed to spend.
What should you alert on? User-facing symptoms (SLO burn, error rate), not every internal cause — to avoid alert fatigue.

← previous
CI/CD