Who is Kaushik Saravanan?

Kaushik Saravanan is an AI/ML engineer and MS in Artificial Intelligence Engineering candidate at Carnegie Mellon University (ECE, expected December 2027), based in Pittsburgh, PA. He was previously an Associate Application Engineer at SAP Labs India (2024–2026), where he shipped production GDPR-compliant RAG and LLM systems to 400+ users. IEEE-published researcher and Smart India Hackathon 2022 winner.

Is Kaushik Saravanan open to new AI/ML roles?

Yes. Kaushik is open to Summer 2027 AI/ML and RAG internships in the US, and full-time AI engineering roles starting January 2028 after his CMU MS-AIE graduation. Reach out via LinkedIn (linkedin.com/in/kaushiksss) or X (@Kaushiks0).

Does Kaushik need visa sponsorship?

Kaushik is an F-1 international student at Carnegie Mellon University. He has 3-year STEM OPT eligibility after his December 2027 graduation, and is open to employers who sponsor H-1B afterward.

What did Kaushik build at SAP Labs India?

At SAP Labs India (2024–2026) he engineered a GDPR-compliant, privacy-first RAG platform for SAP's internal chatbot. He scaled it to 2M+ documents and 400+ users with <2s p95 end-to-end latency, fine-tuned DeBERTa for Germany-specific PII detection (94% recall@10, MRR@10=0.82), and rewrote a credential-fetch client in dependency-free Go for 9,000+ Linux servers.

What are Kaushik's IEEE publications?

Two IEEE papers: 'Swarm Intelligence-Based Cooperative Intelligent Transportation System' (ICCIES 2025) and 'Cognitive Intrusion Detection System in Autonomous Vehicles Using Machine Learning' (ICPECTS 2024).

What is Kaushik's tech stack?

Python, Go, FastAPI, PyTorch, TensorFlow, Hugging Face Transformers, LangChain, PostgreSQL, Docker, Kubernetes, NVIDIA CUDA, Google Cloud Platform, and Microsoft Azure. Specializes in RAG pipelines, LLM fine-tuning (DeBERTa, QLoRA), and cloud observability.

A Dependency-Free Go Binary Is the Right Answer for a 9,000-Server Fleet

Q: What are Kaushik's IEEE publications?

Two IEEE papers: 'Swarm Intelligence-Based Cooperative Intelligent Transportation System' (ICCIES 2025) and 'Cognitive Intrusion Detection System in Autonomous Vehicles Using Machine Learning' (ICPECTS 2024).

Q: What is Kaushik's tech stack?

Python, Go, FastAPI, PyTorch, TensorFlow, Hugging Face Transformers, LangChain, PostgreSQL, Docker, Kubernetes, NVIDIA CUDA, Google Cloud Platform, and Microsoft Azure. Specializes in RAG pipelines, LLM fine-tuning (DeBERTa, QLoRA), and cloud observability.

The problem

I was shipping a security-critical credential-fetch client to a fleet of roughly 9,000 Linux servers. The client did one thing — talk to a control plane over mutual TLS, pull a short-lived credential, hand it to the local agent, exit. It ran on a timer, on every host, forever. The blast radius of a bad rollout was the entire fleet, and the blast radius of a working exploit against the client was worse.

Two constraints framed everything. The client had to be small enough to reason about as an attack surface, and it had to survive being the same binary on 9,000 hosts that were not, in any meaningful sense, the same host.

That second constraint is the whole post.

The naive first approach

"We already have Python." That was the sentence that kept coming up. The rest of the platform was Python. The control plane was Python. The existing agent was Python. Writing the client in Python meant reusing the internal HTTP library, reusing the existing mTLS helpers, reusing the logging conventions, and reusing the developers who already knew all of that.

So the first version was a Python client. requests for HTTP, the standard ssl module for cert loading, a small wrapper around the internal credential API, packaged as a wheel and installed via the fleet's config-management pipeline.

It worked. On my laptop. On a staging host. On the first hundred production hosts.

Then it stopped working, in a different way, on every subsequent hundred.

What actually broke at fleet scale

9,000 hosts is not one problem. It's 9,000 slightly different problems that share a name.

Pinned interpreters, drifted. The fleet had Python 3.6, 3.8, 3.9, 3.10, and a small population of 3.11 machines that a well-meaning team had upgraded ahead of the rest. requests didn't care. cryptography cared a lot — the wheel we shipped needed a compatible OpenSSL, and "compatible OpenSSL" was a moving target across RHEL 7, RHEL 8, and a handful of SUSE variants.

Pip mirror reachability. The install path pulled wheels from an internal mirror. On the ~1% of hosts sitting behind a strange egress firewall, the install hung. On the ~0.1% of hosts whose proxy env vars had been half-set by an old Ansible run, the install failed noisily. On the handful of hosts whose clock was drifted enough to fail TLS to the mirror, the install failed cryptically.

TLS trust store drift. The mTLS handshake to the control plane needed a specific CA bundle. Python's ssl module happily used the system trust store, and the system trust store had been curated by three different teams on three different OS families over five years. Every host had roughly the right CAs. "Roughly" was doing a lot of work.

Systemd unit variations. The timer that ran the client was, in theory, one unit file. In practice it was one unit file plus every drop-in override that had accumulated since 2019, with Environment=PYTHONPATH=... lines pointing at Python installations that no longer existed on hosts that had been re-imaged twice.

None of these were bugs in the Python client. They were bugs in the assumption that a Python client is a thing you can ship, rather than a thing you have to keep alive against a hostile environment.

The decision

I rewrote the client as a statically-linked Go binary, CGO_ENABLED=0, no runtime, no dynamic linker calls, no external CA bundle assumed to exist. One executable, cross-compiled once, dropped onto every host in the fleet.

Static linking stopped being a compile flag and started being an operational primitive. The binary carried its own TLS stack, its own CA bundle, its own DNS resolver. The host contributed a kernel and a filesystem. Nothing else.

The tradeoff

The honest way to write this is a table. Numbers are from the actual rollout — one workload, one fleet, one credential-fetch call. Your mileage will vary.

Axis	Python client (wheel + interpreter)	Go binary (`CGO_ENABLED=0`, static)
Ship artifact	wheel + pinned deps + interpreter assumption	single ~7 MB ELF
Container image size (when we did package it)	~180 MB (python:3.10-slim + deps)	~9 MB (scratch + binary)
Cold start on the timer	400-900 ms interpreter warmup	~15 ms
Cross-compile cost	non-trivial — manylinux, per-OS wheel matrix	`GOOS=linux GOARCH=amd64 go build`, one command
Runtime dependencies on host	Python 3.x, OpenSSL, CA bundle, pip reachability	none
Security-scan surface	interpreter + stdlib + `requests` + `cryptography` + transitive	Go stdlib + one internal package
CVE response time	patch a transitive dep, rebuild wheel, redeploy across 9k hosts	rebuild binary, redeploy
Failure modes at rollout	dozens, mostly environmental	binary either runs or doesn't
Debuggability on a single host	good — REPL, tracebacks	worse — need logs, no REPL

The trade was explicit. I paid in developer familiarity and per-host debuggability, and I bought the ability to reason about the client as a single artifact rather than as a distribution of possible artifacts.

At 90 servers, I would have kept the Python client. At 9,000, the operational primitive I needed was "one file, no assumptions."

Implementation notes

A few things mattered more than the language choice.

The HTTP client was as small as it could be. The Go standard library is enough. No third-party HTTP client, no retry framework, no middleware stack. The entire network path was a few dozen lines of net/http with a tls.Config built from an embedded CA bundle. Every dependency I didn't take was one fewer thing to CVE-scan and one fewer transitive graph to audit.

// Roughly the credential-fetch call. Stdlib only.
// The CA bundle is compiled in via //go:embed.
 
//go:embed control-plane-ca.pem
var caBundle []byte
 
func fetchCredential(ctx context.Context, endpoint string, clientCert tls.Certificate) ([]byte, error) {
    pool := x509.NewCertPool()
    if !pool.AppendCertsFromPEM(caBundle) {
        return nil, errors.New("embedded CA bundle failed to parse")
    }
 
    client := &http.Client{
        Timeout: 10 * time.Second,
        Transport: &http.Transport{
            TLSClientConfig: &tls.Config{
                RootCAs:      pool,
                Certificates: []tls.Certificate{clientCert},
                MinVersion:   tls.VersionTLS12,
            },
        },
    }
 
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint, nil)
    if err != nil {
        return nil, err
    }
 
    resp, err := client.Do(req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
 
    if resp.StatusCode != http.StatusOK {
        return nil, fmt.Errorf("credential fetch: %s", resp.Status)
    }
    return io.ReadAll(resp.Body)
}

The CA bundle was embedded, not read from disk. //go:embed put the trust anchor for the control plane into the binary itself. The host's trust store was no longer part of the failure surface. If someone had poisoned /etc/ssl/certs on a host, our client didn't care.

The binary was reproducible. -trimpath, -buildvcs=false, and a pinned Go toolchain meant the same source produced the same bytes on any build host. Reproducibility matters when a supply-chain question turns into "was the binary on host 4,182 the same binary we intended to ship?" and the answer needs to be a byte-for-byte comparison rather than a shrug.

Logging was structured, boring, and to stdout. Systemd collected it. No log framework, no rotation logic, no side channel. If journald could see it, we could see it.

A dense wall of small server tiles on the left representing 9,000 hosts, each running the 7 MB static Go binary in a scratch container. A control plane on the right — a rounded box labeled 'credential API · mTLS · short-lived tokens' — fans thin arrows out to the fleet; a second rounded box below it labels the rollout channel with a single binary hash and destination count. Underneath the fleet, a struck-through band lists what is not required on host: python3 runtime, OpenSSL version pin, CA trust bundle, and a reachable pip or apt mirror. — One binary, one hash, 9,000 destinations. The struck-through pills are what the host stopped needing the moment we stopped shipping a Python interpreter.

What surprised me

The Go binary was smaller than the Python container image, and it wasn't close.

I had expected the tradeoff to be "you give up some artifact size for operational simplicity." What I got was a 9 MB scratch-based container running a statically-linked binary versus a 180 MB python:3.10-slim image running the equivalent Python client with its pinned deps. The Python image was twenty times larger and still couldn't run on a host that didn't already have a compatible libc.

The security-scan surface followed the same shape. Our container scanner produced page after page of findings against the Python image — most of them in transitive dependencies of cryptography, most of them not actually exploitable by our client, all of them requiring triage. The Go binary produced a scan result that fit on one screen. When a real CVE landed in the Go standard library, we rebuilt one binary and rolled it. When a real CVE landed in cryptography, we would have been rebuilding a wheel matrix.

I had gone in expecting to argue for the Go binary on the grounds that it was operationally simpler. I ended up arguing for it on the grounds that it was smaller and safer, which is a better argument, and one I hadn't expected to make.

What I'd do differently at 10x scale

At 90,000 servers, the client itself is still the right shape. The things around it are what I'd change.

Fleet observability, not host observability. At 9,000 hosts, "grep the logs" was a viable debug strategy for the long tail of weird cases. At 90,000, it isn't. I'd ship the client with a small, opinionated telemetry emitter — structured events, a bounded queue, a single sink — so that fleet-level failure rates were queryable in a dashboard rather than reconstructed from journald across ten thousand hosts.
Auto-rollback on rollout signal. The rollout channel we used was a config-management push. It was fine at 9,000. At 90,000, a bad binary reaching even 1% of hosts is a 900-host incident. I'd want the rollout to be canary-first, watch a live error-rate signal from the telemetry emitter, and pull the artifact automatically if that signal crossed a threshold. Humans should not be the interlock on a 90,000-host push.
Signed artifacts, verified on-host. Reproducible builds get you halfway. The other half is the host verifying that the binary it just received is the binary the release process actually signed. cosign-style verification against a public key baked into the previous binary generation, with a clear roll-forward story when the signing key rotates.
Two channels, not one. A stable channel and a canary channel, with the canary population deliberately weighted toward the weirdest hosts — old kernels, tight egress, unusual clocks. The bugs I saw at 9,000 all came from the long tail. I'd want the long tail to see the binary first, not last.

The meta-lesson: at fleet scale, the language is a rounding error and the dependency graph is the whole game. Python was not the wrong language. Python-with-a-runtime-and-a-pinned-dependency-tree-that-has-to-exist-on-9,000-hosts was the wrong shipping unit. The right shipping unit was a single file that answered every environmental question with "I brought my own."

More on the platform work behind this — mutual-TLS credential fanout, fleet rollouts, and the security posture that motivated it — is on the projects page.

Cite as: Saravanan, K. (2026). A Dependency-Free Go Binary Is the Right Answer for a 9,000-Server Fleet. Kaushik Saravanan. https://www.kaushik.cv/blog/dependency-free-go-for-9k-server-fleet