Who is Kaushik Saravanan?

Kaushik Saravanan is an AI/ML engineer and MS in Artificial Intelligence Engineering candidate at Carnegie Mellon University (ECE, expected December 2027), based in Pittsburgh, PA. He was previously an Associate Application Engineer at SAP Labs India (2024–2026), where he shipped production GDPR-compliant RAG and LLM systems to 400+ users. IEEE-published researcher and Smart India Hackathon 2022 winner.

Is Kaushik Saravanan open to new AI/ML roles?

Yes. Kaushik is open to Summer 2027 AI/ML and RAG internships in the US, and full-time AI engineering roles starting January 2028 after his CMU MS-AIE graduation. Reach out via LinkedIn (linkedin.com/in/kaushiksss) or X (@Kaushiks0).

Does Kaushik need visa sponsorship?

Kaushik is an F-1 international student at Carnegie Mellon University. He has 3-year STEM OPT eligibility after his December 2027 graduation, and is open to employers who sponsor H-1B afterward.

What did Kaushik build at SAP Labs India?

At SAP Labs India (2024–2026) he engineered a GDPR-compliant, privacy-first RAG platform for SAP's internal chatbot. He scaled it to 2M+ documents and 400+ users with <2s p95 end-to-end latency, fine-tuned DeBERTa for Germany-specific PII detection (94% recall@10, MRR@10=0.82), and rewrote a credential-fetch client in dependency-free Go for 9,000+ Linux servers.

What are Kaushik's IEEE publications?

Two IEEE papers: 'Swarm Intelligence-Based Cooperative Intelligent Transportation System' (ICCIES 2025) and 'Cognitive Intrusion Detection System in Autonomous Vehicles Using Machine Learning' (ICPECTS 2024).

What is Kaushik's tech stack?

Python, Go, FastAPI, PyTorch, TensorFlow, Hugging Face Transformers, LangChain, PostgreSQL, Docker, Kubernetes, NVIDIA CUDA, Google Cloud Platform, and Microsoft Azure. Specializes in RAG pipelines, LLM fine-tuning (DeBERTa, QLoRA), and cloud observability.

Redact at Retrieval, Not at Ingest: A GDPR-Compliant RAG Architecture

Q: What are Kaushik's IEEE publications?

Two IEEE papers: 'Swarm Intelligence-Based Cooperative Intelligent Transportation System' (ICCIES 2025) and 'Cognitive Intrusion Detection System in Autonomous Vehicles Using Machine Learning' (ICPECTS 2024).

Q: What is Kaushik's tech stack?

Python, Go, FastAPI, PyTorch, TensorFlow, Hugging Face Transformers, LangChain, PostgreSQL, Docker, Kubernetes, NVIDIA CUDA, Google Cloud Platform, and Microsoft Azure. Specializes in RAG pipelines, LLM fine-tuning (DeBERTa, QLoRA), and cloud observability.

The problem

I was building a GDPR-compliant RAG platform on a corpus that had crossed two million documents, serving four hundred users, with a p95 end-to-end latency budget under two seconds. Inside that budget I had to fit vector retrieval, a cross-encoder reranker, the LLM generation itself, and a PII redaction pass that was non-negotiable under Bundesdatenschutzgesetz. If any single stage blew its slice, the whole thing missed.

The interesting question was not "can we detect PII." That's a fine-tuned DeBERTa away. The interesting question was where in the pipeline the detector should run.

The naive first approach

The first design I sketched — and the design every compliance-first vendor deck recommends — was to redact at ingest. Run the PII model over every document as it entered the corpus, replace detected entities with placeholders, embed the redacted text, and store the redacted version in the index. The reasoning is seductive: if PII never enters the vector store, you cannot leak it. The compliance officer nods. The architecture diagram is clean.

Two things broke.

The first was recall on legitimate queries. A user typing "how do I request my Sozialversicherungsnummer" is asking a completely benign, publicly documented administrative question. The phrase "Sozialversicherungsnummer" is a PII entity class — my detector fired on it — but the query was legitimate and the answer is public. If I redacted the entity out of the corpus at ingest time, I had also redacted it out of the embeddings, and now the query and the corpus disagreed on what the document was even about. Recall on that class of question fell off a cliff. The model was doing its job. The pipeline was doing the wrong job.

The second was irreversibility. Redaction at ingest is destructive. If tomorrow the definition of what counts as PII shifts — a new entity class, a court ruling, a tenant with different rules — I have to re-embed the entire corpus. At 2M documents that's a real bill and a real outage window.

I could patch both — maintain a shadow un-redacted corpus, add allowlists for public entities, keep two indices — but every patch made the "PII never enters the vector store" claim less true. The compliance win I was buying was slipping away, and I was paying for it in recall and rebuild cost.

The decision

I moved the redaction pass downstream of retrieval. The index stored the original text. Retrieval returned original chunks. The PII detector ran on the retrieved chunks, in the tight window between the reranker's output and the LLM's prompt. What left the system was redacted. What lived inside it was not.

The tradeoff

The honest way to write this is a table. Numbers are directional, from the workload I built for.

Axis	Redact at ingest	Redact at retrieval
Recall on queries citing PII entity classes	dropped sharply	preserved
Corpus rebuild cost when PII rules change	full re-embed of 2M docs	swap the detector, no re-embed
Blast radius if the detector fails	quiet corpus corruption	one bad chunk to one user
PII surface at rest in the index	none (in theory)	full text (access-controlled)
Latency added to hot path	zero	one detector pass per query
Ops story	one-shot at ingest	detector on the request path

The trade was explicit. I paid latency and I moved the PII surface from "gone at rest" to "present at rest, gated at egress." That is a real security tradeoff and it needed a real answer: the index sat behind the same tenant ACLs and encryption-at-rest the rest of the platform used, and no un-redacted chunk ever crossed the LLM boundary. The bet was that "gated at egress with a good detector" was closer to the actual regulatory ask than "destroyed at ingest with broken recall."

Fitting it in the latency budget

The retrieval slice of the budget was about 120 ms. The reranker was another 200-300 ms depending on candidate set size. The LLM was the elephant, usually 900-1400 ms depending on prompt length. That left me a small, sharp window — call it 100 ms — for the PII pass to slot in between the reranker and the generator without pushing p95 over two seconds.

The thing that made it fit was batching. A DeBERTa-base inference on a single ~500-token chunk on our GPU tier was in the 40-60 ms range. Running it k times sequentially for k=10 reranked chunks would have been 400-600 ms and would have eaten the whole budget by itself. Running it as a single batched forward pass over all ten chunks in one go was closer to 70 ms. That was the surprise: the PII pass added less latency than I had budgeted for, because entity detection over "ten chunks concatenated with separators" amortized the per-call overhead almost completely. I had penciled in 150 ms and shipped with 70.

The reranker also earned its keep here. Without a reranker, the PII pass would have had to run over the raw top-k from HNSW, which we wanted to be wide (k=50 or higher) to give the reranker room. With the reranker in front, the PII pass only ever saw the top ten chunks the reranker actually promoted, and the batched cost was capped by k=10, not k=50.

The pipeline

The full retrieval-to-generation path had four stages. Each one had a job the next one depended on.

        query text
            |
            v
   [ query embedding ]
            |
            v
   [ HNSW vector search ]  -- top 50, per-tenant filtered
            |
            v
   [ cross-encoder reranker ]  -- top 10, scored
            |
            v
   [ DeBERTa PII pass ]  -- batched over 10 chunks, entities to placeholders
            |
            v
   [ LLM generation ]  -- sees redacted context only
            |
            v
        answer

The diagram is worth sitting with. The PII pass is the last thing that touches the chunks before the model, and it is the first thing that has ever seen them in an outbound direction. Everything to its left is inside the trust boundary. Everything to its right is outside it. The redaction step is the boundary.

Redact at ingest breaks recall on queries whose targets are legitimate PII entity classes. Redact at retrieval keeps the corpus embeddable and moves the boundary to the last step before the LLM — a batched ~70 ms DeBERTa pass that slots between the reranker and generation without breaching the p95 budget.

Implementation notes

A few things that mattered more than the model choice.

The detector was fine-tuned, not off-the-shelf. DeBERTa-base pretrained checkpoints are strong on English PII entity types and mediocre on German administrative vocabulary. I fine-tuned on a labeled corpus of Bundesdatenschutzgesetz entity types — Sozialversicherungsnummer, Steuer-ID, health identifiers, address components — and reached 94% recall@10 on the entity classes that mattered. Precision was less critical here: a false positive redacts a token the model didn't need; a false negative leaks. The loss function reflected that asymmetry.

Batching was the whole game. Ten chunks were tokenized as one batched input with attention-mask separators, run through the model in a single forward pass, and the entity spans were mapped back to their originating chunks by offset. The alternative — a ten-call loop — was measured, was six times slower, and was the version I almost shipped.

The redaction was reversible for the auditor, irreversible for the LLM. Detected entities were replaced with typed placeholders ([SVN], [ADDR], [NAME_1]) in the prompt to the LLM. The original spans were kept in a per-request audit log inside the trust boundary. If a compliance review needed to reconstruct what the model saw versus what the corpus held, the mapping existed. The LLM never got the reverse mapping.

Selective redaction stayed off the roadmap. I was tempted to build a "redact only if the requesting user isn't the entity subject" rule. Doing that correctly means resolving entity references to identities and matching them to authenticated users, which is a whole retrieval problem of its own. The MVP redacted uniformly. The uniform rule shipped. The clever rule stayed on the wishlist.

# Rough shape of the retrieve-rerank-redact-answer flow.
# The PII pass is one batched call, not a loop.
 
def answer(query, tenant_id):
    q_vec = embed(query)
 
    # Retrieve full, un-redacted chunks.
    candidates = hnsw.search(q_vec, k=50, tenant=tenant_id, ef_search=64)
 
    # Rerank narrows 50 -> 10.
    top = reranker.rerank(query, candidates, top_k=10)
 
    # One batched forward pass, not ten.
    entity_spans = pii_model.detect_batch([c.text for c in top])
 
    # Replace entities in place with typed placeholders.
    redacted = [
        redact(chunk.text, spans)
        for chunk, spans in zip(top, entity_spans)
    ]
 
    # Audit log lives inside the trust boundary; LLM does not see it.
    audit.log(tenant_id, query, top, entity_spans)
 
    return llm.generate(query, context=redacted)

What surprised me

I had expected the PII pass to be the stage that made or broke the latency budget. It wasn't. Batching turned a 500 ms sequential problem into a 70 ms parallel one, and the real budget pressure came from the LLM, where I had almost no leverage. Everything upstream of the LLM ended up with more slack than I had planned for, and everything downstream had none.

The other surprise was how much simpler operations got. One detector version, running on the request path, meant I could roll out a new PII model with a config flip. No re-embedding runs, no shadow indices, no "which version of the corpus is this?" ambiguity. The redaction pass was ephemeral by design. That was worth more than the latency saving.

What I'd do differently at 10x scale

At 400 users the batched pass at k=10 was comfortable. At 4,000 concurrent users it would not be. The path I'd take:

Streamed redaction into the LLM prefill. At scale, the LLM's first-token latency dominates. If the PII pass and the LLM prefill can overlap — redact chunk one while the LLM starts on chunks one and two — the redaction step disappears into the shadow of the generation step.
A smaller, distilled detector for the hot path. DeBERTa-base was the right precision/latency tradeoff at 400 users. A distilled 6-layer variant, fine-tuned against the base as teacher, was on my roadmap for the next tier. Losing a point of recall to gain 3x throughput would have been a good trade at 4,000 users, and a bad trade at 400.
Per-tenant detector variants. Not every tenant has the same PII vocabulary. Health tenants care about ICD codes; finance tenants care about IBANs; administrative tenants care about SVN. One detector doing all jobs is a monolith. A small model registry keyed by tenant is not.
Move the audit log out of the request path. Writing the pre-redaction spans synchronously to the audit store was fine at our load. At scale, that write becomes a tail-latency hazard. A durable queue with async flush, and an audit reader that reconstructs on demand, is the version that scales.

The meta-lesson: redact at the boundary the data actually crosses, not the boundary that's easiest to draw on the whiteboard. Ingest is the easy boundary. Egress to the LLM is the honest one.