Skip to main content

Redact at Retrieval, Not at Ingest: A GDPR-Compliant RAG Architecture

Share:XLinkedInHN

The problem

I was building a GDPR-compliant RAG platform on a corpus that had crossed two million documents, serving four hundred users, with a p95 end-to-end latency budget under two seconds. Inside that budget I had to fit vector retrieval, a cross-encoder reranker, the LLM generation itself, and a PII redaction pass that was non-negotiable under Bundesdatenschutzgesetz. If any single stage blew its slice, the whole thing missed.

The interesting question was not "can we detect PII." That's a fine-tuned DeBERTa away. The interesting question was where in the pipeline the detector should run.


The naive first approach

The first design I sketched — and the design every compliance-first vendor deck recommends — was to redact at ingest. Run the PII model over every document as it entered the corpus, replace detected entities with placeholders, embed the redacted text, and store the redacted version in the index. The reasoning is seductive: if PII never enters the vector store, you cannot leak it. The compliance officer nods. The architecture diagram is clean.

Two things broke.

The first was recall on legitimate queries. A user typing "how do I request my Sozialversicherungsnummer" is asking a completely benign, publicly documented administrative question. The phrase "Sozialversicherungsnummer" is a PII entity class — my detector fired on it — but the query was legitimate and the answer is public. If I redacted the entity out of the corpus at ingest time, I had also redacted it out of the embeddings, and now the query and the corpus disagreed on what the document was even about. Recall on that class of question fell off a cliff. The model was doing its job. The pipeline was doing the wrong job.

The second was irreversibility. Redaction at ingest is destructive. If tomorrow the definition of what counts as PII shifts — a new entity class, a court ruling, a tenant with different rules — I have to re-embed the entire corpus. At 2M documents that's a real bill and a real outage window.

I could patch both — maintain a shadow un-redacted corpus, add allowlists for public entities, keep two indices — but every patch made the "PII never enters the vector store" claim less true. The compliance win I was buying was slipping away, and I was paying for it in recall and rebuild cost.


The decision

I moved the redaction pass downstream of retrieval. The index stored the original text. Retrieval returned original chunks. The PII detector ran on the retrieved chunks, in the tight window between the reranker's output and the LLM's prompt. What left the system was redacted. What lived inside it was not.


The tradeoff

The honest way to write this is a table. Numbers are directional, from the workload I built for.

AxisRedact at ingestRedact at retrieval
Recall on queries citing PII entity classesdropped sharplypreserved
Corpus rebuild cost when PII rules changefull re-embed of 2M docsswap the detector, no re-embed
Blast radius if the detector failsquiet corpus corruptionone bad chunk to one user
PII surface at rest in the indexnone (in theory)full text (access-controlled)
Latency added to hot pathzeroone detector pass per query
Ops storyone-shot at ingestdetector on the request path

The trade was explicit. I paid latency and I moved the PII surface from "gone at rest" to "present at rest, gated at egress." That is a real security tradeoff and it needed a real answer: the index sat behind the same tenant ACLs and encryption-at-rest the rest of the platform used, and no un-redacted chunk ever crossed the LLM boundary. The bet was that "gated at egress with a good detector" was closer to the actual regulatory ask than "destroyed at ingest with broken recall."


Fitting it in the latency budget

The retrieval slice of the budget was about 120 ms. The reranker was another 200-300 ms depending on candidate set size. The LLM was the elephant, usually 900-1400 ms depending on prompt length. That left me a small, sharp window — call it 100 ms — for the PII pass to slot in between the reranker and the generator without pushing p95 over two seconds.

The thing that made it fit was batching. A DeBERTa-base inference on a single ~500-token chunk on our GPU tier was in the 40-60 ms range. Running it k times sequentially for k=10 reranked chunks would have been 400-600 ms and would have eaten the whole budget by itself. Running it as a single batched forward pass over all ten chunks in one go was closer to 70 ms. That was the surprise: the PII pass added less latency than I had budgeted for, because entity detection over "ten chunks concatenated with separators" amortized the per-call overhead almost completely. I had penciled in 150 ms and shipped with 70.

The reranker also earned its keep here. Without a reranker, the PII pass would have had to run over the raw top-k from HNSW, which we wanted to be wide (k=50 or higher) to give the reranker room. With the reranker in front, the PII pass only ever saw the top ten chunks the reranker actually promoted, and the batched cost was capped by k=10, not k=50.


The pipeline

The full retrieval-to-generation path had four stages. Each one had a job the next one depended on.

        query text
            |
            v
   [ query embedding ]
            |
            v
   [ HNSW vector search ]  -- top 50, per-tenant filtered
            |
            v
   [ cross-encoder reranker ]  -- top 10, scored
            |
            v
   [ DeBERTa PII pass ]  -- batched over 10 chunks, entities to placeholders
            |
            v
   [ LLM generation ]  -- sees redacted context only
            |
            v
        answer

The diagram is worth sitting with. The PII pass is the last thing that touches the chunks before the model, and it is the first thing that has ever seen them in an outbound direction. Everything to its left is inside the trust boundary. Everything to its right is outside it. The redaction step is the boundary.

Two horizontal swimlanes contrasting redact-at-ingest and redact-at-retrieval architectures. The top lane scrubs PII from the corpus before indexing; a query for 'Sozialversicherungsnummer' misses because the embeddings no longer carry the entity, and a red X marks the broken recall arrow. The bottom lane leaves the corpus intact, sends full chunks through HNSW retrieval and a cross-encoder reranker, then passes them through a green DeBERTa PII boundary that masks entities before they reach the LLM; recall is preserved and the 70 ms pass fits inside the sub-2s p95 budget shown along the footer.
Redact at ingest breaks recall on queries whose targets are legitimate PII entity classes. Redact at retrieval keeps the corpus embeddable and moves the boundary to the last step before the LLM — a batched ~70 ms DeBERTa pass that slots between the reranker and generation without breaching the p95 budget.

Implementation notes

A few things that mattered more than the model choice.

The detector was fine-tuned, not off-the-shelf. DeBERTa-base pretrained checkpoints are strong on English PII entity types and mediocre on German administrative vocabulary. I fine-tuned on a labeled corpus of Bundesdatenschutzgesetz entity types — Sozialversicherungsnummer, Steuer-ID, health identifiers, address components — and reached 94% recall@10 on the entity classes that mattered. Precision was less critical here: a false positive redacts a token the model didn't need; a false negative leaks. The loss function reflected that asymmetry.

Batching was the whole game. Ten chunks were tokenized as one batched input with attention-mask separators, run through the model in a single forward pass, and the entity spans were mapped back to their originating chunks by offset. The alternative — a ten-call loop — was measured, was six times slower, and was the version I almost shipped.

The redaction was reversible for the auditor, irreversible for the LLM. Detected entities were replaced with typed placeholders ([SVN], [ADDR], [NAME_1]) in the prompt to the LLM. The original spans were kept in a per-request audit log inside the trust boundary. If a compliance review needed to reconstruct what the model saw versus what the corpus held, the mapping existed. The LLM never got the reverse mapping.

Selective redaction stayed off the roadmap. I was tempted to build a "redact only if the requesting user isn't the entity subject" rule. Doing that correctly means resolving entity references to identities and matching them to authenticated users, which is a whole retrieval problem of its own. The MVP redacted uniformly. The uniform rule shipped. The clever rule stayed on the wishlist.

# Rough shape of the retrieve-rerank-redact-answer flow.
# The PII pass is one batched call, not a loop.
 
def answer(query, tenant_id):
    q_vec = embed(query)
 
    # Retrieve full, un-redacted chunks.
    candidates = hnsw.search(q_vec, k=50, tenant=tenant_id, ef_search=64)
 
    # Rerank narrows 50 -> 10.
    top = reranker.rerank(query, candidates, top_k=10)
 
    # One batched forward pass, not ten.
    entity_spans = pii_model.detect_batch([c.text for c in top])
 
    # Replace entities in place with typed placeholders.
    redacted = [
        redact(chunk.text, spans)
        for chunk, spans in zip(top, entity_spans)
    ]
 
    # Audit log lives inside the trust boundary; LLM does not see it.
    audit.log(tenant_id, query, top, entity_spans)
 
    return llm.generate(query, context=redacted)

What surprised me

I had expected the PII pass to be the stage that made or broke the latency budget. It wasn't. Batching turned a 500 ms sequential problem into a 70 ms parallel one, and the real budget pressure came from the LLM, where I had almost no leverage. Everything upstream of the LLM ended up with more slack than I had planned for, and everything downstream had none.

The other surprise was how much simpler operations got. One detector version, running on the request path, meant I could roll out a new PII model with a config flip. No re-embedding runs, no shadow indices, no "which version of the corpus is this?" ambiguity. The redaction pass was ephemeral by design. That was worth more than the latency saving.


What I'd do differently at 10x scale

At 400 users the batched pass at k=10 was comfortable. At 4,000 concurrent users it would not be. The path I'd take:

  1. Streamed redaction into the LLM prefill. At scale, the LLM's first-token latency dominates. If the PII pass and the LLM prefill can overlap — redact chunk one while the LLM starts on chunks one and two — the redaction step disappears into the shadow of the generation step.
  2. A smaller, distilled detector for the hot path. DeBERTa-base was the right precision/latency tradeoff at 400 users. A distilled 6-layer variant, fine-tuned against the base as teacher, was on my roadmap for the next tier. Losing a point of recall to gain 3x throughput would have been a good trade at 4,000 users, and a bad trade at 400.
  3. Per-tenant detector variants. Not every tenant has the same PII vocabulary. Health tenants care about ICD codes; finance tenants care about IBANs; administrative tenants care about SVN. One detector doing all jobs is a monolith. A small model registry keyed by tenant is not.
  4. Move the audit log out of the request path. Writing the pre-redaction spans synchronously to the audit store was fine at our load. At scale, that write becomes a tail-latency hazard. A durable queue with async flush, and an audit reader that reconstructs on demand, is the version that scales.

The meta-lesson: redact at the boundary the data actually crosses, not the boundary that's easiest to draw on the whiteboard. Ingest is the easy boundary. Egress to the LLM is the honest one.


See also


More on the RAG platform this was built for — 2M+ documents, 400+ users, sub-2s p95, GDPR-compliant PII detection — is on the projects page.

Cite as: Saravanan, K. (2026). Redact at Retrieval, Not at Ingest: A GDPR-Compliant RAG Architecture. Kaushik Saravanan. https://www.kaushik.cv/blog/redact-at-retrieval-gdpr-rag