Elasticsearch — Search & Analytics Engine

When a product needs real search — typo-tolerant, ranked-by-relevance, fast over millions of documents — a relational WHERE text LIKE '%query%' falls apart, because it scans every row and can't rank results. Elasticsearch (built on Apache Lucene) solves this with a fundamentally different data structure, the inverted index, and a distributed engine on top for scale. It powers full-text search, log analytics (the "E" in the ELK stack), and observability backends. Understanding it is mostly understanding the inverted index and relevance scoring.

⚡ Quick Takeaways

The inverted index maps each term → the list of documents containing it, so a search is a fast lookup, not a full scan.
Analyzers turn text into tokens (lowercasing, stemming, stop-word removal) at index and query time — that's why "Running" matches "run".
Relevance, not just matching — results are scored (BM25) and ranked, so the best matches come first.
Documents → indices → shards → replicas — JSON docs in an index, split into shards spread across nodes, each shard replicated.
Near-real-time, not instant — newly indexed docs become searchable after a refresh (default ~1s).
Filters vs queries — filters are yes/no and cacheable (fast); queries compute a relevance score.
Not your primary store — it's a search index alongside your source-of-truth database, not a replacement.

tldr

Elasticsearch flips the data model: instead of "for each doc, scan its text," it builds an inverted index ("for each term, which docs?"), making full-text search a lookup. Analyzers tokenize and normalize text so searches are flexible; BM25 ranks results by relevance. Data is sharded and replicated across nodes, and a query scatters to all shards and gathers the top results. It's near-real-time, distinguishes cacheable filters from scored queries, and complements — never replaces — your primary database.

The Inverted Index

A normal database index (or a brute-force scan) is organized by document: given a row, find its text. Full-text search needs the opposite: given a word, find all documents that contain it. The inverted index does exactly that — it's a map from each distinct term to a posting list of the documents (and positions) where that term appears.

inverted index: term → docs

docs:  1="the quick fox"   2="quick brown dog"   3="lazy fox"

inverted index:
   "quick" → [1, 2]
   "fox"   → [1, 3]
   "brown" → [2]
   "lazy"  → [3]

search "quick fox" → intersect [1,2] ∩ [1,3] = doc 1 ranks highest
                     (both terms) ; docs 2,3 match one term each

A search for "quick fox" looks up each term's posting list and combines them — instant, regardless of corpus size, because there's no scanning of document bodies. This is the same idea behind the typeahead trie, generalized to arbitrary words anywhere in a document.

Analysis: Tokenization and Analyzers

For search to feel smart, raw text must be normalized before it's indexed. An analyzer runs a pipeline: a tokenizer splits text into terms, then token filters transform them — lowercasing ("Fox" → "fox"), stemming ("running"/"ran" → "run"), removing stop words ("the", "a"), and more. Crucially, the same analyzer runs at both index time and query time, so the query is normalized to match the indexed terms. That's why searching "Running" finds a document containing "ran" — both reduce to the stem "run". Choosing analyzers (per language, with synonyms, etc.) is most of the art of good search quality.

Relevance Scoring

Search isn't just "which docs match" but "which match best." Elasticsearch scores each matching document and returns them ranked. The default algorithm, BM25 (a refinement of TF-IDF), rewards documents where the query terms appear often (term frequency) but discounts terms that are common across the whole corpus (inverse document frequency) and normalizes for document length. The result: a rare, specific word the user searched for counts for more than a common one, and a short doc packed with the term outranks a long one that mentions it once. Relevance ranking is the feature that distinguishes a search engine from a database filter.

Documents, Indices, Shards, and Replicas

The data model is layered. A document is a JSON object. An index is a collection of similar documents (like a table). Because an index can be huge, it's split into shards — each shard is a self-contained Lucene index holding a subset of the documents — and shards are distributed across nodes. Each shard has a primary plus one or more replicas for fault tolerance and read scaling.

scatter-gather across shards

index "logs"  →  shard 0 | shard 1 | shard 2   (each on a node, + replicas)

query → coordinating node
        ├─ scatter to shard 0, 1, 2  (search in parallel)
        ├─ each returns its local top-k
        └─ gather + merge + re-rank → global top-k → client

A search is scatter-gather: a coordinating node sends the query to every shard, each finds its local top results, and the coordinator merges them into the global ranking. Note a key trade-off: the shard count is largely fixed at index creation (resharding means reindexing), so you size shards up front based on expected data volume.

Near-Real-Time Search

Elasticsearch is near-real-time, not instant. Indexed documents are first written to an in-memory buffer; a periodic refresh (default ~1 second) turns that buffer into a searchable Lucene segment. So a document you just indexed becomes searchable about a second later, not immediately. Segments are immutable and periodically merged in the background; durability is provided by a translog (write-ahead log). This refresh delay is the price of the indexing efficiency, and it's tunable (faster refresh = more overhead).

Queries vs Filters

A crucial performance distinction: a query asks "how well does this match?" and computes a relevance score; a filter asks "does this match, yes or no?" with no scoring. Filters are cheaper and their results are cacheable, so structured constraints (status = "active", date range, category) should be filters, while the free-text part is a scored query.

Aspect	Query (must)	Filter
Question	"How relevant?" (scored)	"Match? yes/no" (no score)
Cacheable	No	Yes
Use for	Free-text relevance	Exact constraints (status, range, tag)

Beyond search, Elasticsearch also does aggregations — fast group-by/metrics over matching documents (counts per category, histograms over time) — which is what makes it a real-time analytics engine for dashboards and logs.

What It's Used For

Full-text search — product catalogs, documentation, content sites with ranking and typo tolerance.
Log & event analytics — the ELK / Elastic stack (Elasticsearch + Logstash/Beats + Kibana) for centralized logs.
Observability — searching and aggregating metrics, logs, and traces (see observability).
Autocomplete & suggestions — edge-ngram analyzers or completion suggesters.

Elasticsearch vs Alternatives

Versus a relational database: a SQL DB is your source of truth with transactions and joins; Elasticsearch is a denormalized, eventually-consistent search index you populate from that DB (often via change data capture). Versus a vector database: classic Elasticsearch does lexical search (matching terms), while vector DBs do semantic search (matching meaning via embeddings) — modern Elasticsearch supports vector/kNN search too, and hybrid search combining both is increasingly common.

Pitfalls

Not a primary datastore — it's a search index; keep your authoritative data in a real database and reindex into ES.
Mapping explosions — dynamically indexing every field of arbitrary JSON can blow up the index; define mappings deliberately.
Deep pagination — "page 10,000" is expensive (scatter-gather must rank everything before); use search_after / scroll instead.
Shard sizing — too many tiny shards waste overhead; too few limits parallelism. Size up front since resharding means reindexing.

takeaway

Elasticsearch is the inverted index made distributed and ranked. The inverted index turns search into a lookup; analyzers make matching flexible; BM25 ranks by relevance; and shards/replicas plus scatter-gather make it scale and stay available. Treat it as a search/analytics layer fed from your primary database — use filters for exact constraints, queries for relevance, and remember it's near-real-time.

🎯 interview hot-takes

Why not just SQL LIKE? LIKE scans every row and can't rank; the inverted index (term → docs) makes search a lookup and BM25 ranks by relevance.
What does an analyzer do? Tokenizes and normalizes text (lowercase, stem, stop-words) at index and query time, so "Running" matches "ran".
Query vs filter? Query computes a relevance score (not cacheable); filter is a yes/no match (cacheable). Use filters for exact constraints, queries for free text.
How does a distributed search run? Scatter-gather: the coordinator queries every shard in parallel, each returns local top-k, the coordinator merges into the global ranking.
Is it real-time? Near-real-time — new docs are searchable after a refresh (~1s), not instantly; and it's a search index, not your primary store.