A URL shortener is a deceptively simple service: take a long URL, return a short one, and redirect the short one back to the original. At a surface level it looks like a two-row database with a lookup. But the design challenge compounds quickly at scale: you must generate short IDs that are guaranteed unique across a distributed fleet without a global lock, keep redirect latency near zero for every link click everywhere in the world, handle a read:write ratio of 100:1 without a read bottleneck, support custom aliases without collision with the auto-generated namespace, purge expired links without impacting hot-path performance, and record click analytics while keeping the redirect path as fast as a CDN edge response. Each of these is a non-trivial system design decision, and their interactions create subtle correctness and performance traps.

⚡ Quick Takeaways
  • 6-char base-62 slug — 62⁶ > 56 billion unique slugs; sufficient for years of URL volume at this scale.
  • ID generation beats hashing — Snowflake or range assignment avoids collisions at the source; MD5/SHA-1 prefix collisions require detection + retry loops.
  • Bloom filter for collision check — Redis Bloom filter: negative = definitely new (safe to write); positive = maybe exists (fall back to DB lookup). Keeps the fast path cheap.
  • Redis caching on reads — absorbs the 100:1 read-to-write ratio; most redirects never hit the database.
  • 301 vs. 302 is a product decision — 301 caches in browser (fewer hits, no analytics); 302 hits shortener every time (full analytics, higher load).
  • Custom aliases need a separate namespace — user-supplied slugs must be checked against both the auto-generated namespace and the custom alias table to prevent collisions.
  • Click analytics via async pipeline — log click events to Kafka on every redirect; analytics consumers aggregate asynchronously without adding latency to the hot redirect path.
tldr

6-character base-62 slugs give sufficient namespace for a year's worth of URLs. Use ID generation (Snowflake or range assignment) over hashing to avoid collisions. Add a Redis Bloom filter to check for existing slugs before writing. Cache hot redirects in Redis. Purge expired URLs via TTL, scheduled scans, or lazy evaluation. Rate-limit write endpoints to prevent DDoS. Stream click events to Kafka for async analytics.

High-level URL shortener architecture
High-level URL shortener architecture

Step 1 — Clarify Requirements

Before designing anything, scope the problem explicitly. URL shortener is a common interview prompt because it seems trivial but reveals depth quickly. Ask the interviewer about these dimensions:

Functional requirements

Non-functional requirements

interview tip

Explicitly ask whether click analytics are required — the answer drives your redirect strategy. If yes, you likely want 302 redirects (no browser caching) and an analytics pipeline. If no, 301 + aggressive CDN caching is simpler and cheaper. State your choice and the reasoning.

Step 2 — Capacity Estimation

Back-of-the-envelope sizing anchors storage and infrastructure decisions. Assume a mid-size deployment with 10M DAU.

Traffic

Storage

Slug namespace sizing

note

The storage math shows why URL shortener is an often-misunderstood "hard" problem: the data volume is tiny. The real challenges are uniqueness guarantees at distributed write scale, redirect latency at read scale, and correctness of the custom alias namespace — not raw storage.

Step 3 — API Design

Clean, minimal REST surface. The entire service is essentially two endpoints plus a management layer:

HTTP
# Write: create a short URL
POST /api/shorten
     Authorization: Bearer <token>
     { "long_url": "https://example.com/very/long/path?q=1",
       "custom_alias": "my-brand",   // optional
       "expires_at": "2025-12-31" }  // optional201 { "short_url": "https://tiny.url/aB3kPq",
             "slug": "aB3kPq", "expires_at": "2025-12-31" }

# Read: redirect
GET /{slug}302 Location: https://example.com/very/long/path?q=1
     → 404 if slug not found or expired

# Analytics: click stats for a link
GET /api/links/{slug}/analytics
     Authorization: Bearer <token>  // must own the link
     → { "total_clicks": 14827, "unique_visitors": 9301,
         "clicks_by_country": {...}, "clicks_by_day": [...] }

# Management: delete a link
DELETE /api/links/{slug}
       Authorization: Bearer <token>

The POST /api/shorten endpoint is idempotent for the same (long_url, custom_alias) pair — submitting the same request twice returns the existing slug rather than creating a duplicate. Rate-limit this endpoint per user token: free tier might allow 10 creates/min, paid tier 100/min. The GET /{slug} redirect endpoint is on the hot path; it must be as fast as possible and is never rate-limited for readers (only the write side needs protection).

Step 4 — Short ID Generation: Hashing vs. ID Generation

The core question is how to generate the 6-character slug. There are four main approaches, each with distinct tradeoffs:

ApproachMechanismCollision RiskSortable
MD5 / SHA-1 hashHash the long URL, take first 6 chars of base-62 encodingHigh — birthday paradox bites at 56B space with short prefixNo
UUID (random)Generate random 128-bit UUID, encode first 6 chars in base-62Low but nonzero — requires collision detectionNo
Snowflake ID41-bit timestamp + 10-bit machine ID + 12-bit sequence; encode in base-62Zero if machine IDs uniqueYes (time-ordered)
Range assignmentCentral controller assigns numeric ranges to worker nodes; workers use next value in rangeZeroYes

Why hashing is problematic

Hashing the long URL and taking the first 6 characters seems elegant — same URL always produces the same slug (deterministic), no coordination needed. The problem is the birthday paradox: in a 56-billion-slot space, after inserting roughly √(2 × 56B) ≈ 334,000 URLs, the probability of a collision exceeds 50%. At 1M URLs/day, you hit this in under a day. Every collision requires detection (DB lookup or Bloom filter check) and a retry with a salt — now you have a retry loop on the write path with indeterminate latency.

Snowflake ID generation

Twitter's Snowflake format generates 64-bit IDs from three components packed into a single integer: a 41-bit millisecond timestamp (giving ~69 years of headroom from epoch), a 10-bit machine ID (supporting 1,024 unique generator nodes), and a 12-bit sequence counter (allowing 4,096 IDs per millisecond per machine). Converting this 64-bit integer to base-62 gives at most 11 characters — for a 6-char slug, take the lower 36 bits (base-62 encoding of numbers up to 62⁶ ≈ 56B requires ~36 bits).

Python
import time

EPOCH = 1700000000000  # custom epoch ms
MACHINE_ID = 1          # assigned at deploy time (0–1023)
BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
_seq = 0

def generate_snowflake():
    global _seq
    ts = int(time.time() * 1000) - EPOCH
    _seq = (_seq + 1) & 0xFFF   # 12-bit rollover
    return (ts << 22) | (MACHINE_ID << 12) | _seq

def to_base62(n: int, length=6) -> str:
    chars = []
    while n > 0:
        chars.append(BASE62[n % 62])
        n //= 62
    return "".join(reversed(chars)).zfill(length)[-length:]

def new_slug() -> str:
    return to_base62(generate_snowflake())

Range assignment (counter-based)

An alternative to Snowflake: a central coordinator (e.g., a ZooKeeper node or a dedicated counter service backed by a SQL AUTO_INCREMENT) assigns numeric ranges to worker nodes. Worker 1 gets 0–999,999; worker 2 gets 1,000,000–1,999,999, etc. Each worker converts its next counter value to base-62 and uses it as the slug. No coordination is needed within a range — the worker increments a local counter. When a range is exhausted, the worker requests a new range from the coordinator. This approach is simpler than Snowflake and collision-free but creates a single-point coordinator that must be highly available.

scalability note

For most interview scenarios, Snowflake is the preferred answer: it is coordination-free at the ID generation level (each node generates IDs independently given a pre-assigned machine ID), time-ordered (useful for debugging and analytics), and collision-free by construction. Range assignment is a valid alternative — prefer it if the interviewer asks about simpler setups.

URL shortener system flow diagram
URL shortener system flow diagram

Step 5 — Bloom Filter for Collision Detection

Even with collision-free ID generation, you still need collision detection for the hash-based approach and for custom aliases (which are user-supplied strings that could collide with auto-generated slugs or with each other). The canonical tool is a Redis Bloom filter.

A Bloom filter is a probabilistic data structure that answers "is this element definitely not in the set?" or "is this element possibly in the set?". It never produces false negatives but may produce false positives. The operational rule:

This keeps the write fast path — the overwhelmingly common case (new slug) — an O(1) in-memory Redis check rather than a DB round-trip. The false-positive rate (controlled by the filter's bit array size and hash count) is typically configured to 1% or lower, meaning only 1 in 100 writes incurs an unnecessary DB lookup.

scalability note

A single Redis Bloom filter becomes a bottleneck at high scale. Distributing Bloom filter checks across multiple Redis instances reduces the bottleneck but sacrifices a single source of truth — a known tradeoff without a clean solution at extreme scale. An alternative is to skip the Bloom filter entirely and use Snowflake IDs (which are collision-free by construction), checking only custom aliases in the DB.

Step 6 — Data Model

The URL store is simple but the schema details matter for correctness and query performance:

SQL
CREATE TABLE urls (
  slug          VARCHAR(8)  PRIMARY KEY,    -- short code (6–8 chars)
  long_url      TEXT        NOT NULL,
  user_id       BIGINT,                     -- NULL if anonymous
  is_custom     BOOLEAN     DEFAULT FALSE, -- user-supplied alias
  created_at    TIMESTAMP   NOT NULL,
  expires_at    TIMESTAMP,                  -- NULL = never expires
  click_count   BIGINT      DEFAULT 0      -- approximate, incremented async
);

CREATE INDEX idx_user_id ON urls(user_id);   -- list user's links
CREATE INDEX idx_expires ON urls(expires_at) -- purge expired rows
  WHERE expires_at IS NOT NULL;

-- Analytics events (high-volume, written to Kafka → OLAP store)
-- NOT stored in the hot SQL DB
-- Schema of Kafka message:
-- { slug, timestamp_ms, ip, user_agent, referrer, country, device_type }

Key schema decisions: the slug is the primary key — every redirect is a primary-key lookup, which is O(log n) on a B-tree index and extremely fast. The expires_at partial index (filtered to non-NULL rows) keeps the purge scan cheap. The click_count column is an approximate counter — it is incremented asynchronously by the analytics consumer, not on the redirect hot path, so it may lag by minutes but never blocks a redirect.

Step 7 — Core Architecture and Read/Write Paths

flow
-- Write path (shorten a URL)
POST /api/shorten { long_url, custom_alias?, expires_at? }
  → Rate limiter (token bucket per user_id or IP)
  → If custom_alias provided:
      check alias not already taken (DB lookup)
      slug = custom_alias
    Else:
      slug = to_base62(snowflake_id())
  → Check Redis Bloom filter (for hash approach only)
      negative → safe to write
      positive  → DB lookup to confirm; retry on collision
  → Write { slug, long_url, user_id, expires_at } to primary DB
  → Add slug to Bloom filter (if using)
  → Add slug → long_url to Redis cache (TTL = min(expires_at, 24h))
  → Return { short_url: "https://tiny.url/{slug}" }

-- Read path (redirect)
GET /{slug}
  → Check Redis cache
      hit  → check expires_at → 302/404 redirect
      miss → DB lookup by PRIMARY KEY
               found  → cache → 302 redirect
               not found → 404
  → Fire-and-forget: publish click event to Kafka
      { slug, ts, ip, user_agent, referrer, country }
Read and write path flow for URL redirect
Read and write path flow for URL redirect

Redis caching in depth

With a 100:1 read-to-write ratio, caching is the single highest-leverage optimization. The cache pattern is look-aside with write-through: on a write, populate the cache immediately so the first redirect is already a cache hit; on a read miss, pull from the DB and populate. Key caching decisions:

Step 8 — Redirect: 301 vs. 302

The HTTP redirect response code is a product decision with significant system-design implications:

CodeMeaningBrowser behaviorAnalytics impactServer load
301Permanent RedirectCaches in browser; subsequent clicks skip the shortenerOnly first click is recordedVery low — browser handles repeat visits
302Temporary RedirectDoes not cache; every click hits the shortenerEvery click is recordedHigher — all clicks go through service

If click analytics are a product requirement, you must use 302 — a 301 means the shortener never sees repeat clicks from the same browser after the first. If analytics are not required and you want maximum performance at minimum cost, 301 + CDN caching is the right choice: CDN edge nodes cache the 301 and serve repeat redirects at edge latency without touching your origin servers.

hybrid approach

Some services use 302 for the first N clicks to gather analytics, then switch to 301 after a threshold. This is complex to implement correctly (the switch must be coordinated with cache invalidation) but is occasionally asked about in advanced interviews. Simpler: use 302 always but add a CDN layer that caches the redirect for a short TTL (e.g., 60 s) to reduce origin load while still capturing most clicks.

Step 9 — Custom Aliases

Custom aliases (tiny.url/my-brand) are user-supplied strings that must not collide with either auto-generated slugs or with other users' custom aliases. The namespace management rules:

Step 10 — Click Analytics Pipeline

Click analytics must not add latency to the redirect hot path. The solution is a fire-and-forget async pipeline:

  1. On every redirect, the redirect service publishes a click event to Kafka asynchronously (non-blocking). The redirect itself returns immediately.
  2. Kafka consumers (analytics workers) consume the event stream and write to a time-series store (ClickHouse, BigQuery, or Redshift) optimized for aggregation queries.
  3. A separate aggregation job (Spark batch or Flink streaming) pre-computes per-link rollups: total clicks, unique IPs, clicks by country/device/referrer, clicks by hour/day.
  4. The analytics API (GET /api/links/{slug}/analytics) reads from the pre-computed rollup store, not from raw events — O(1) lookup rather than a full scan of billions of click records.
flow
Redirect service
  └── fire-and-forget → Kafka topic: "click-events"
        { "slug": "aB3kPq", "ts": 1716556800000, "ip": "1.2.3.4",
          "country": "US", "device": "mobile", "referrer": "twitter.com" }

Analytics consumer (Flink streaming)
  └── reads Kafka → counts per (slug, hour, country, device)
  └── writes rollups to ClickHouse:
        -- clicks_hourly(slug, hour, country, device, click_count)

Analytics API
  └── SELECT SUM(click_count) FROM clicks_hourly WHERE slug = 'aB3kPq'
      → returns in <10 ms (indexed rollup table)

Step 11 — URL Expiration and Purging

Expired URLs must be cleaned up to reclaim storage and return correct 404 responses. Three strategies, each with different tradeoffs:

StrategyHow it worksProsCons
Redis TTLSet TTL on the cache entry equal to the link's expiry; cache evicts automaticallyZero DB overhead; works for cache hitsExpired rows accumulate in DB; DB still shows record as existing
Lazy evaluationOn every redirect, check expires_at < NOW(); return 404 and optionally delete the DB rowNo batch job needed; self-cleaningExpired rows accumulate until accessed; delete on the hot path adds latency
Scheduled scanAirflow job runs nightly: DELETE FROM urls WHERE expires_at < NOW() using the partial indexDB stays clean; predictable loadExpired links are "valid" for up to 24 h in DB; batch delete can be slow on large tables

The recommended approach combines all three: Redis TTL handles the cache layer correctly for most traffic; lazy expiry check in the redirect handler catches DB hits on expired-but-cached-elsewhere links; nightly scan keeps the DB table clean and the partial index small. The lazy delete should be done asynchronously (e.g., enqueue to a deletion queue) to keep the redirect response fast.

Step 12 — Scaling and Fault Tolerance

At high scale, the URL shortener must remain available and fast despite hardware failures, traffic spikes, and regional outages.

Read scaling

The redirect service is stateless — it reads from Redis and occasionally from the DB. Scale horizontally behind a load balancer. With a hot Redis cache absorbing 95%+ of reads, the DB is under minimal load. Add read replicas for the DB to handle the remaining cache misses. Deploy the redirect service in multiple regions with a CDN or GeoDNS routing users to their nearest instance.

Write scaling

At ~30 writes/sec, a single SQL primary easily handles URL creation. If write scale becomes a concern (say, 10,000 writes/sec for a platform that also offers programmatic link creation via API), shard the DB by slug prefix: slugs starting with 0-9 go to shard A, a-m to shard B, n-z to shard C. Since each redirect is a primary-key lookup on the slug, routing to the correct shard is a simple character map — no cross-shard joins needed.

Multi-region consistency

A globally distributed URL shortener must handle the case where a link created in US-East is immediately clicked by a user in Asia-Pacific. Options:

Fault tolerance

Step 13 — Rate Limiting and Security

The write endpoint must be protected with a rate limiter to prevent DDoS attacks that generate millions of slugs, exhausting both storage and the slug namespace. Security layers:

Step 14 — Key Tradeoffs

DecisionChoiceTrade-off accepted
Slug generationSnowflake (ID-based)Requires pre-assigned machine IDs; slight time-ordering reveals creation volume
Redirect code302 (temporary)Higher server load vs. 301; required if click analytics are a product feature
Collision detectionBloom filter in RedisFalse positives require DB fallback; filter itself needs to be distributed at extreme scale
Analytics pipelineAsync (Kafka → OLAP)Clicks are counted with a brief lag; redirect latency is not impacted
Expiry enforcementLazy + nightly scanExpired links accessible for up to the cache TTL after expiry (seconds); DB cleaned within 24 h
StorageSQL (single table)Trivially fits the data volume; no need for the complexity of distributed NoSQL
takeaway

URL shortener design is a study in making boring things scale: Snowflake ID generation beats hashing (avoid collisions at the source), Bloom filters make collision checks cheap on the write path, Redis caching absorbs the 100:1 read amplification, and slug length arithmetic confirms 6 characters is sufficient for decades. The subtle choices — 301 vs. 302, which purge strategy, whether to use a Bloom filter at all with Snowflake IDs — are driven by product requirements, not technical constraints. Know the tradeoffs, state your assumptions, and justify each choice.

🎯 interview hot-takes

Why prefer ID generation over hashing for slug generation? Hash functions (MD5, SHA-1) can produce the same 6-character prefix for different URLs due to the birthday paradox — collision detection and retry add complexity and indeterminate latency. Snowflake or range-based ID generation eliminates collisions at the source.
How does the Bloom filter speed up collision detection? A Bloom filter negative (bit not set) guarantees the slug is new — no DB lookup needed. Only positives (rare) require a DB fallback. This keeps the write fast path O(1) in memory.
301 or 302 redirect? 301 (permanent) is browser-cached — lower server load but click analytics are lost after the first visit. 302 (temporary) hits the shortener on every click — full analytics at higher cost. Choose based on whether analytics are a product requirement.
How do you handle a viral link that gets 10M clicks in one hour? Hot key in Redis — cache it across multiple shards or add an in-process LRU cache in each redirect service instance (1–5 s TTL) to absorb the spike before it hits Redis.
How do you prevent the analytics pipeline from slowing down redirects? Fire-and-forget publish to Kafka — the redirect returns the 302 immediately and does not wait for the Kafka write to confirm. Loss of a few click events during a Kafka outage is acceptable; redirect availability is not.

← previous
Design a Ride-share App