Skip to main content

HNSW or IVF-PQ? What I Actually Chose at 2M Documents

Share:XLinkedInHN

The problem

I was building the retrieval layer for a GDPR-compliant RAG platform inside SAP Labs India, and the corpus had just crossed two million documents on its way to something larger. p95 end-to-end latency needed to sit under two seconds — retrieval plus rerank plus generation plus the PII redaction pass we ran on every hop. The retrieval slice of that budget was about 120 ms.

That number is the whole post. Everything else is a consequence of it.


The naive first approach

Every "vector search at scale" guide I read in 2024 pointed at the same recipe: IVF-PQ. Partition your vectors into a few thousand Voronoi cells with k-means (the IVF part), then quantize each vector down to a stack of 8-bit product codes (the PQ part). You get order-of-magnitude memory compression and sub-linear search. FAISS ships it. Milvus ships it. Every benchmark table has a row for it.

So I built it. On our workload, an IVF4096,PQ64 index over 2M ~768-dim embeddings fit in a shockingly small amount of RAM, and single-query search was fast enough on paper.

Two things broke.

The first was recall. Product quantization is lossy by construction — you're replacing a float vector with the centroid ID of the nearest cluster in each sub-space. On a corpus that mixes German legal text, English engineering docs, and structured metadata, that lossy step chewed through the recall we needed for the reranker to have anything to work with. Recall@10 on our internal eval fell into a range I could not defend to a product owner who cared about "did the right answer even make the top ten."

The second was tail latency under concurrency. IVF search is nprobe cells scanned linearly. When traffic went up and cells thrashed out of the CPU cache, p95 latency drifted in a way the p50 never showed. The mean was fine. The tail was where users lived.

I could patch both — bigger nprobe, re-ranking with the original floats, IVF-PQ-Refine — but every patch pushed the index closer to "the memory savings you were buying it for are gone."


The decision

I picked HNSW. Hierarchical Navigable Small World graphs, floats kept in RAM, no quantization on the hot path.


The tradeoff

The honest way to write this is a table. Numbers below are directional, from our workload — not a synthetic benchmark. Your mileage will vary with dimensionality, distribution, and hardware.

AxisIVF-PQ (IVF4096,PQ64)HNSW (M=32, efConstruction=200, efSearch=64)
RAM footprint at 2M vecs, 768-dimvery compact — codes only~6-8x larger; floats + graph edges
Recall@10 on our evaldropped below what our reranker could recover fromin the 94% range
p50 query latencyfastcomparable, sometimes faster
p95 under concurrencydrifted upwardflat
Insert costcheap; append to a listO(log N) graph walk per insert
Rebuild costmust retrain k-means on driftnone — grow in place
Behavior on OOD queries (new German legal jargon)quantization noise dominatesgraceful degradation
Ops storyretrain cadence, nprobe tuningtune efSearch at query time

The trade was explicit: I paid in RAM and I bought recall stability, tail-latency stability, and one fewer training pipeline to babysit. At 2M docs on the hardware I had, that trade was straightforwardly the right one.

At 200M docs, I would not make the same trade. More on that at the end.


Implementation notes

A few things that mattered more than the index choice itself:

Parameters we actually shipped. M=32 (max out-degree per node), efConstruction=200 (candidate set during build), efSearch=64 at query time as the default, with a per-request override for retrieval paths that were willing to pay 2-3 ms extra for a slightly deeper walk. M=32 is toward the high end — memory cost scales linearly with M — but the recall lift over M=16 was meaningful on the German-language subset and the vector budget could absorb it.

Filtered search was the actual hard problem. Nobody's RAG system is unfiltered. Every query in ours was scoped by tenant, by document ACL, and often by metadata (date range, document class). Post-filtering an HNSW result set works fine when your filter is unselective, and falls apart when it's selective — you walk the graph, get 64 neighbors, and 61 of them are for the wrong tenant, so you re-walk with a bigger efSearch, and now you've blown your latency budget on a request that should have been trivial. We ended up with a hybrid: a coarse metadata prefilter that shrank the candidate universe before the graph walk on selective filters, and vanilla post-filtering otherwise. The switch point was tuned per tenant.

The PII step ran downstream of retrieval, not upstream. We fine-tuned DeBERTa-base for German PII detection and got a +6 entity-F1 lift on Bundesdatenschutzgesetz classes over the XLM-R baseline — that model ran on the retrieved chunks before they were handed to the generator, not on the corpus at ingest time. Redacting at ingest would have poisoned the embeddings for any query that legitimately referenced a public entity. Redacting at retrieval kept the index clean and let us push retrieval concurrency without also scaling a PII inference tier per shard.

The client for the whole thing was a 9,000-server fleet. The credential-fetch client that fanned into this retrieval layer was a dependency-free Go binary, statically linked, no runtime. That decision has nothing to do with HNSW — but it's the reason the tail-latency story held up under real production load. A Python client with a cold VM on each invocation would have added variance we couldn't have engineered away in the index.

# Roughly what the query path did.
# The two knobs that mattered in production were efSearch and the prefilter.
 
def retrieve(query_vec, tenant_id, filters, k=10):
    # Selective filters: shrink the universe before the graph walk.
    if filters.is_selective(tenant_id):
        candidate_ids = metadata_prefilter(tenant_id, filters)   # bitmap
        return hnsw.search(
            query_vec,
            k=k,
            ef_search=64,
            allowed_ids=candidate_ids,           # graph walk skips disallowed nodes
        )
 
    # Unselective filters: walk the whole graph, post-filter cheaply.
    raw = hnsw.search(query_vec, k=k * 4, ef_search=96)
    return [hit for hit in raw if filters.match(hit)][:k]

HNSW hierarchical graph with a query walking greedily from a sparse top layer down through denser layers to a target neighborhood at the base, alongside an IVF-PQ Voronoi partition where the same query lands in one cell and nprobe neighboring cells are scanned across quantized codes

The two indexes, side by side. Left: HNSW walks the graph in log time over floats — RAM heavy, recall stable. Right: IVF-PQ assigns the query to a centroid and scans nprobe cells of quantized codes — RAM light, but every hop loses information.


What surprised me

The build path was the pain, not the query path.

HNSW inserts are O(log N) in the pretty-picture version, but they're O(log N) graph walks that touch cache-cold memory, and they don't parallelize the way you want. When we did a bulk backfill of a large document set, the insert throughput of a single builder process was the bottleneck of the whole ingestion pipeline — not the embedding model, not the network, not the storage. The generator that was writing the embeddings could produce them faster than the index could absorb them.

I had budgeted for HNSW to cost more memory. I had not budgeted for HNSW to cost more ingestion throughput. We ended up sharding the index build across workers and doing a merge pass at the end, which works but is not free — HNSW graph merges are not the trivially parallel operation IVF-PQ retraining is, because you're stitching graphs, not re-partitioning a flat list.

If I'd known that going in, I would have written the ingestion pipeline batch-first from day one instead of streaming-first.


What I'd do differently at 10x scale

At 20M docs, HNSW-with-floats is still fine on a fat instance. At 200M, the trade flips.

The path I would take:

  1. Two-tier retrieval. A coarse first stage over quantized vectors (product quantization or scalar quantization) to get to a candidate set of ~10k, then re-rank those with the original floats. The recall loss of quantization stops mattering when you're using it to shortlist, not to answer.
  2. DiskANN before pure HNSW. DiskANN's Vamana graph is built to spill to SSD without the pathological random-read pattern HNSW gets when you push it out of RAM. At 200M vectors, "keep it in RAM" stops being a strategy and starts being a bill.
  3. Filter-aware indexes. ACORN-style filter-first graph walks, or per-tenant sub-indexes when tenants are large enough to justify their own graph. The hybrid pre/post-filter switch we shipped is a stopgap, not an architecture.
  4. Move the reranker's job earlier. Some of what our cross-encoder reranker was doing could have been absorbed into a better first-stage index. That's a research bet, but at 200M it's the bet I'd take.

The meta-lesson: pick the index for the size you're at, not the size you might be at. IVF-PQ is the right answer at 200M and the wrong answer at 2M. HNSW is the right answer at 2M and the wrong answer at 200M. The engineering discipline is being honest about which regime you're in and being ready to switch before you're forced to.


See also


More on the RAG platform this was built for — 2M+ documents, 400+ users, sub-2s p95, GDPR-compliant PII detection — is on the projects page.

Try it: click to place points, then press Search (or alt/shift-click a spot to query from there). This shows the single-layer greedy walk — the real HNSW does the same thing on each layer of a hierarchy.

hnsw greedy search (single layer)

0/60 nodes

click to add · alt/shift-click to query

Cite as: Saravanan, K. (2026). HNSW or IVF-PQ? What I Actually Chose at 2M Documents. Kaushik Saravanan. https://www.kaushik.cv/blog/hnsw-vs-ivfpq-at-2m-docs