Design a Payment System — System Design

A payment system moves money, and that single fact changes everything: correctness is non-negotiable. A bug in a feed can show a stale post; a bug here can double-charge a customer or lose a transaction, and there's no "eventually it'll be fine." The design is dominated by three ideas — idempotency (a retried request must never charge twice), a double-entry ledger (an immutable, always-balanced record of every movement of money), and reconciliation (continuously verifying your books against the payment provider). It's the most direct application of the correctness ideas in our DDIA notes on transactions and end-to-end correctness.

⚡ Quick Takeaways

Correctness over availability — it's a CP-leaning system; better to reject a payment than to process it twice or lose it.
Idempotency keys are mandatory — the client generates a unique key per payment intent; retries with the same key return the original result, never a second charge.
Double-entry ledger — every transaction writes balanced debit + credit entries to an append-only, immutable ledger; balances are derived, never edited in place.
Integrate a PSP, don't touch cards — route to Stripe/Adyen/etc.; tokenize cards so raw PANs never enter your system (PCI scope).
Async with webhooks + a state machine — payments move pending → authorized → captured → settled; the PSP confirms asynchronously.
Reconcile relentlessly — compare your ledger against PSP/bank statements daily to catch and fix any discrepancy. Trust, but verify.
Use a saga, not 2PC — you can't two-phase-commit across external providers; orchestrate with compensating actions.

tldr

Make every payment idempotent via a client-supplied key + a dedup store, so retries are safe. Record money movements in an immutable double-entry ledger and derive balances from it. Don't store card data — delegate to a PSP and tokenize. Model the payment as a state machine driven by asynchronous webhooks, orchestrate multi-step flows with a saga (compensating actions, not 2PC), and run daily reconciliation against the provider to guarantee your books match reality.

payment flow

 ┌────────┐  pay(idempotency_key)  ┌──────────────┐
 │ Client │──────────────────────▶│   Payment    │
 └────────┘                       │   Service    │
                                  └───┬───┬───────┘
              dedup + ledger write     │   │  charge
            ┌───────────────┐◀─────────┘   ▼
            │  Ledger DB    │          ┌──────────┐  ┌──────────┐
            │ (double-entry)│          │   PSP    │─▶│   Bank   │
            └───────────────┘          │ (Stripe) │  │  network │
            ┌───────────────┐  webhook └──────────┘  └──────────┘
            │ Idempotency   │◀── "captured/failed" (async)
            │   store       │
            └───────────────┘   nightly reconciliation vs PSP statements

Step 1 — Clarify Requirements

Functional: accept a payment (pay-in) from a customer for an order; integrate with a payment service provider (PSP); track each payment's status; support refunds; (optionally) payouts to sellers. Non-functional, and these dominate: correctness/consistency above all (no double charges, no lost or phantom payments), durability, idempotency, auditability (a complete immutable history for compliance), and reasonable availability. We explicitly favor consistency over availability: if in doubt, fail safe and reconcile, rather than risk moving money twice.

Step 2 — Capacity Estimation

Payment volume is modest compared to a feed — say 10M payments/day (~120/sec average, higher at peak sale events). The numbers aren't the challenge; correctness under failure is. Every component must assume the network can drop a request or response at the worst moment (see the trouble with distributed systems), which is precisely why idempotency and reconciliation are first-class, not afterthoughts.

Step 3 — API Design

core API (idempotency key required)

POST /payments
   Idempotency-Key: 7f3c-...-a91     # client-generated, per attempt
   {order_id, amount, currency, payment_method_token}
      → {payment_id, status}
GET  /payments/{payment_id}          → {status, amount, ...}
POST /payments/{payment_id}/refund   {amount}  Idempotency-Key: ...

The Idempotency-Key header is the linchpin (Step 5), and the card is passed as a token, never a raw number (Step 11).

Step 4 — The Payment Flow and State Machine

A payment is not instantaneous; it moves through states as it travels to the bank and back, mostly asynchronously. The payment service calls the PSP to authorize and capture funds, but final confirmation arrives later via a webhook callback. Modeling the payment as an explicit state machine keeps this manageable:

State	Meaning	Next
pending	Created, not yet sent/confirmed	authorized / failed
authorized	Funds held on the card	captured / voided
captured	Funds taken (charge confirmed)	settled / refunded
settled	Money actually moved to your account	refunded
failed / voided	Declined or cancelled	terminal

Each transition is durably recorded; the webhook from the PSP drives the async transitions (authorized→captured→settled), and the system must handle webhooks arriving late, out of order, or more than once (so webhook handling is itself idempotent).

Step 5 — Idempotency: Never Charge Twice

The defining hazard: a client sends "pay $50", the request succeeds at the server, but the response is lost; the client retries; now you've charged $50 twice. The cure is an idempotency key — a unique ID the client generates once per payment intent and sends with every retry. The server records the key with the result of the first successful attempt; any later request with the same key returns the stored result instead of executing again.

idempotent payment handling

def pay(key, order, amount):
    if seen(key):                 # retry of a prior attempt
        return stored_result(key)  # return original outcome — no re-charge
    reserve(key)                  # claim the key atomically (unique constraint)
    result = psp.charge(amount, idempotency_key=key)  # PSP is idempotent too
    write_ledger(order, amount)   # in the same transaction as ...
    store_result(key, result)     # ... recording the result
    return result

Crucially, the PSP itself accepts an idempotency key, so even your retry to the PSP won't double-charge at their end. This is the end-to-end argument in action — dedup is enforced with one identifier threaded through every layer, not patched in any single one.

Step 6 — The Double-Entry Ledger

Money is tracked with double-entry bookkeeping, the accounting model banks have used for centuries. Every transaction produces at least two entries that sum to zero: a debit from one account and a matching credit to another. The ledger is append-only and immutable — you never edit or delete an entry; a correction is a new compensating entry. Balances are derived by summing entries, never stored as a mutable number you overwrite.

double-entry: a $50 payment

txn 9001  customer_cash      −50.00   (debit)
txn 9001  merchant_payable   +50.00   (credit)
                             ───────
                              0.00     ← every txn must balance to zero

refund later = a NEW balancing txn, never edit txn 9001
balance(account) = SUM(entries)  — derived, immutable history preserved

This model makes the system auditable and self-checking: because every transaction balances to zero, money can't be silently created or destroyed, and the full history is reconstructable for compliance and dispute resolution.

Step 7 — Data Model

schema

payments         (payment_id, order_id, amount, currency, status,
                  psp_ref, created_at, updated_at)
ledger_entries   (entry_id, txn_id, account, amount, ts)   # append-only
idempotency_keys (key PK, payment_id, response, created_at) # dedup

A relational database with ACID transactions is the right default here (see transactions): writing the ledger entries, updating the payment row, and recording the idempotency result should happen in one atomic transaction so they can't partially apply.

Step 8 — Exactly-Once Effect

True exactly-once delivery is impossible over an unreliable network, so the goal is exactly-once effect: a payment changes money exactly once no matter how many times the request, the webhook, or the PSP call is retried. The combination that achieves it: (1) the idempotency key dedups duplicate client requests; (2) the PSP's own idempotency key dedups duplicate charge calls; (3) webhook handlers are idempotent (processing the same "captured" event twice is a no-op); and (4) the atomic ledger write ties the money movement to the dedup record so they commit together or not at all.

Step 9 — Handling Failures: Sagas, Not 2PC

A payment often spans multiple internal services and an external PSP (and you can't enroll Stripe in your distributed transaction). Two-phase commit doesn't work across external systems and blocks under failure (see consistency & consensus). Instead use a saga: a sequence of local steps, each with a compensating action to undo it. If a later step fails, you run the compensations for the earlier ones (e.g. void an authorization, reverse a ledger entry) to return to a consistent state. Steps are queued and retried with exponential backoff; persistently failing events go to a dead-letter queue for investigation rather than being silently dropped.

Step 10 — Reconciliation

No matter how careful the code, you must verify your books against reality — disks corrupt data, webhooks get missed, and bugs slip through. Reconciliation is a scheduled job (typically nightly) that compares your internal ledger against the settlement reports/statements from the PSP and the bank, line by line, and flags any discrepancy: a payment the PSP recorded that you didn't, an amount mismatch, a missing settlement. Discrepancies are escalated and corrected with compensating ledger entries. This embodies the "trust, but verify" principle — the system is designed to detect and recover from inconsistency, not to assume it never happens.

Step 11 — Security and Compliance

Handling card data directly drags you into the full scope of PCI DSS, an expensive compliance burden. The standard move is to never let raw card numbers (PANs) touch your servers: the client sends card details straight to the PSP (or the PSP's hosted field/SDK), which returns a token; your system stores and charges only the token. Beyond that: encrypt sensitive data at rest and in transit, strictly control access, maintain audit logs (which the immutable ledger already provides), and run fraud detection on the payment stream.

Step 12 — Key Tradeoffs

Consistency over availability. A payment system fails safe — better to reject and retry than to risk a double charge. Reconciliation is the safety net that lets you be conservative.
Synchronous vs asynchronous. Authorization may be synchronous for instant UX, but capture/settlement are asynchronous via webhooks; the state machine absorbs the delay.
Build vs buy. Integrating a PSP (and tokenization) slashes PCI scope and risk versus building card processing yourself — almost always the right call.
Saga vs 2PC. Sagas tolerate external systems and partial failure at the cost of writing compensating logic; 2PC is simpler conceptually but unusable across a third-party PSP.

takeaway

A payment system is a correctness machine. Three mechanisms carry the design: idempotency keys threaded end-to-end so retries never double-charge, an immutable double-entry ledger so money is never silently created or lost and everything is auditable, and continuous reconciliation against the provider so any discrepancy is caught and corrected. Favor consistency, delegate card handling to a PSP, and orchestrate the multi-party flow with sagas rather than distributed transactions.

🎯 interview hot-takes

How do you prevent double charges? A client-generated idempotency key recorded server-side (and passed to the PSP); any retry with the same key returns the original result instead of charging again.
Why a double-entry ledger? Every transaction writes balanced debit/credit entries to an immutable, append-only log; balances are derived, so money can't be silently created/lost and everything is auditable.
Why reconciliation? Code and networks fail; comparing your ledger to PSP/bank statements daily detects and fixes discrepancies — trust, but verify.
Why a saga instead of 2PC? You can't two-phase-commit across an external PSP; a saga uses local steps with compensating actions to stay consistent under failure.
How do you avoid PCI scope? Tokenize — card details go straight to the PSP, which returns a token; raw PANs never touch your servers.