Skip to main content

The <700ms Latency Budget for a Personal AI Voicemail Line

Share:XLinkedInHN

The problem

I wanted a phone number that answered as me. Not a chatbot on a website, not a Discord bot — a real E.164 number that anyone could dial and get an intelligent voice on the other end. That became Dyx: +1 (484) 270-7074, live at voicemail.kaushik.cv, wired up so callers get a conversation instead of a beep.

The whole thing lives or dies on one number: perceived latency to first spoken word. Humans notice a conversational gap at about 300ms. They start feeling awkward at 500. Past a second, they think the line dropped. My budget was 700ms end-to-end, p95, from "caller stops talking" to "caller hears the first phoneme of the response."

That number is the whole post. Everything else is a consequence of it.


The naive first approach

The obvious 2026 answer is a single-model speech-to-speech loop. Gemini Live does this. GPT-4o Realtime does this. You skip the STT and TTS boxes entirely — audio goes in, audio comes out, the model handles turn detection natively — and the vendor benchmarks quote 200-300ms end-to-end. It's the same architectural shift going from separate embedding-plus-reranker pipelines to a single joint retrieval model: fewer boxes, fewer hops, fewer places for latency to hide.

So I built it. LiveKit Cloud Agent, livekit-plugins-google realtime plugin, a Gemini API key vended from my CipherStack instance, one Python file, done.

Two things broke.

The first was auth. The Gemini keys in my CipherStack rotation are billing-enabled generative-language keys — they route just fine for generateContent, they route just fine for streaming text, and they hard-fail on the Live API with a 403 that reads like a permissions error but is actually a project-level API-enablement error. The Live API is gated behind a separate console toggle that my vended keys didn't have flipped on. The rotation vends the least-recently-used key, which meant every retry hit a different key with the same missing flag. I could have manually enabled Live on eight projects, but that defeats the point of a key vault.

The second was the tradeoff I hadn't priced. Even when Live works, you're betting the whole conversation on one model handling STT, LLM reasoning, and TTS. If the model has a bad moment — hallucinates the transcript, picks the wrong voice affect, stalls — you have no fallback, no observability point between stages, and no way to swap components independently. It's the IVF-PQ problem in a different domain: you buy latency by fusing steps, and you pay in stability.

I pivoted.


The decision

Three-stage pipeline: STT → LLM → TTS. Individual latency budgets per stage. Hosted inference at every step so I'm not babysitting a GPU. LiveKit Cloud Agents doing the audio transport and turn detection between them.

  • STT: Deepgram nova-3 (streaming, endpointing at ~300ms of silence)
  • LLM: Groq llama-3.3-70b-versatile (hosted, streaming tokens)
  • TTS: Cartesia sonic-3 (streaming, first-byte latency optimized)
  • Transport: LiveKit Cloud, WebRTC to the caller, gRPC/WebSocket to each provider

The tradeoff

The honest way to write this is a table. Numbers below are directional, from Dyx in production over ~50 real calls — not a synthetic benchmark. Your mileage will vary with region, PSTN routing, and prompt length.

StageProviderp50p95Budget
Caller silence → endpoint detectedLiveKit VAD180ms220ms250ms
STT first token → final transcriptDeepgram nova-3150ms250ms250ms
LLM TTFT (time to first token)Groq llama-3.3-70b150ms200ms200ms
TTS first-byte audioCartesia sonic-3200ms300ms300ms
LiveKit routing + networkWebRTC + PSTN gateway50ms100ms100ms
Total, streaming-overlapped~500ms~680ms700ms

The stages overlap. That's the whole trick. Deepgram is streaming partial transcripts before the caller finishes talking. Groq starts generating tokens the instant a stable partial arrives. Cartesia starts synthesizing audio the instant the first LLM token lands. By the time the final transcript is confirmed, the LLM is often already three tokens in, and by the time the LLM's first sentence is done, Cartesia's first PCM frame is already flying to the caller's ear over WebRTC. The wall-clock latency is the max of the stages, not the sum, on the parts that can be overlapped.

At 2M docs, HNSW was the right index. On a voice agent, streaming with backpressure between every stage is the same shape of decision: pay in coordination complexity, buy stability at every hop.


Implementation notes

A few things that mattered more than the model choices.

LiveKit's pipeline abstraction did the right thing by default. The AgentSession API composes STT/LLM/TTS as pluggable components and handles the streaming stitchwork — partial transcripts get forwarded to the LLM plugin, streamed tokens get chunked into TTS-friendly sentences, generated audio gets pushed to the caller's WebRTC track. I did not write any of the plumbing. I picked components and set knobs.

# The whole pipeline. Genuinely.
# Every knob here was picked for latency, not quality.
 
from livekit.agents import AgentSession, Agent
from livekit.plugins import deepgram, groq, cartesia, silero
 
session = AgentSession(
    stt=deepgram.STT(
        model="nova-3",
        interim_results=True,           # start LLM on partials, not finals
        endpointing_ms=300,              # aggressive; retune per caller
        smart_format=True,
    ),
    llm=groq.LLM(
        model="llama-3.3-70b-versatile",
        temperature=0.6,
        # Groq's TTFT is the win — 150ms p50 for a 70b model is absurd.
    ),
    tts=cartesia.TTS(
        model="sonic-3",
        voice="a-neutral-warm-voice-id",
        # sonic-3 streams first PCM frame at ~200ms, not ~800ms like older models.
    ),
    vad=silero.VAD.load(),               # local, no network hop for turn detection
)
 
await session.start(agent=Agent(instructions=SYSTEM_PROMPT), room=room)

That's it. Ninety percent of the "latency work" was picking providers whose streaming primitives compose without buffering surprises, then setting endpointing aggressively enough that turns actually flow.

The endpointing knob is where p95 lives. Deepgram exposes an endpointing_ms — how long a silence to wait before declaring the caller's turn over. Set it too high (600ms) and the conversation feels sludgy no matter how fast the rest of the pipeline is. Set it too low (150ms) and you cut people off mid-thought. 300ms was the sweet spot for phone-call cadence; a video-call agent would want closer to 500 because people pause differently on video. This is the closest analogue to nprobe in a vector index: one number that dominates the tail.

Groq is the load-bearing choice. Llama-3.3-70b on Groq's LPUs runs at ~300 tokens/second with ~150ms TTFT. The same model on a hosted-GPU provider (Together, Fireworks) runs at ~50 tokens/second with 400-600ms TTFT. That difference is the whole latency budget of a slower stage. I did not want to run a 70b model on my own GPUs for a personal voicemail line, and Groq made the "hosted, cheap, fast enough" corner of the tradeoff space actually livable.

Stacked-lane latency waterfall for the Dyx voice pipeline. Four horizontal lanes — LiveKit VAD endpoint (0–220ms), Deepgram nova-3 STT (100–350ms), Groq llama-3.3-70b LLM (300–550ms), and Cartesia sonic-3 TTS (400–800ms+) — overlap in time. A tangerine vertical marker at 680ms flags the first phoneme in the caller's ear. A shaded band from 250 to 550ms marks the thinking-sound cover-up. A total bar at the bottom shows 680ms end-to-end (p95) against a 700ms budget.
The stages overlap. Total latency is the max of the streaming stages (~680ms), not the sum (~770ms). The tangerine band from 250–550ms is the thinking sound holding the acoustic space while the pipeline works.

What surprised me

Perception is not latency.

I spent a week tuning the pipeline down from 900ms p95 to 680ms p95, and it moved the needle less than a two-line change I made afterward: playing a quiet, barely-audible "thinking sound" — background typing audio, low volume, looping — the instant Deepgram fired its final transcript. The sound holds the acoustic space while Groq is generating tokens and Cartesia is synthesizing the first frame. Callers who tested Dyx before and after the thinking sound consistently rated the "after" version as feeling faster, even when the actual wall-clock latency was identical.

The human brain treats silence as "did the line drop." It treats ambient noise as "someone is preparing to speak." Fill the void and 700ms starts feeling like 300, because you've moved the perceived clock from "silence-to-speech" to "sound-to-speech." Every phone system in history has known this — hold music, hold beeps, keypress feedback tones — and I got to rediscover it on a personal project.

If I'd known that going in, I would have built the thinking sound before I optimized the pipeline. The perceptual lift per hour of work was 10x higher than the technical lift.


What I'd do differently at 10x scale

At one caller, this pipeline is fine. At a hundred concurrent, the trade flips.

The path I would take:

  1. Gemini Live once the API access materializes. The single-model speech-to-speech architecture is genuinely faster when it works, and the observability tradeoff matters less when you have call recordings to review offline anyway. The 200ms ceiling is not fictional; it's just gated behind a console toggle I didn't want to babysit across a rotation of vault-vended keys.
  2. A distilled S2S model. Kyutai's Moshi runs speech-to-speech at ~200ms on a single GPU, weights open. At my scale that's over-engineering; at a hundred concurrent callers, the economics of a self-hosted 7B S2S model start beating three hosted providers on an aggregated bill.
  3. Per-caller endpointing. Some people pause a lot mid-sentence. Some people talk over each other. Static 300ms endpointing is the vector-search equivalent of a single global nprobe — it works on average and thrashes at the tails. A learned endpointing model that adapts to each caller's cadence would move p95 down another 100ms without cutting anyone off.
  4. Move the "thinking sound" into the pipeline as a first-class primitive. Right now it's a hack triggered on transcript-final. It should be a state machine driven by pipeline events, with different sounds for "processing your question" vs. "generating a long response" vs. "I'm actually stuck." The perception knob deserves the same engineering attention as the latency knobs.

The meta-lesson: the naive architecture (single-model realtime) is right for the scale it's built for, and the pipeline architecture is right for the scale I'm at. IVF-PQ is the right answer at 200M vectors and the wrong answer at 2M. Gemini Live is the right answer at ten thousand concurrent callers and the wrong answer at one hobbyist with a rotation of API keys that don't have the right flag flipped. The engineering discipline is being honest about which regime you're in.


See also


Dyx is live at voicemail.kaushik.cv. Or just call +1 (484) 270-7074 — it'll pick up. More on it on the projects page.

Try it: drag any slider to see which stage kills the 700ms budget. Green = headroom, red = over. The defaults reflect the actual prod stack (Deepgram nova-3 STT → Groq llama-3.3-70b LLM → Cartesia sonic-3 TTS over LiveKit). Press Simulate degraded network to see the failure mode.

dyx voicemail latency waterfall (live sim)

budget 700ms · total 630ms · +70ms headroom

180ms
220ms
150ms
80ms
STT (Deepgram nova-3)180ms

Streaming; endpoint after 300ms silence

LLM (Groq llama-3.3-70b)220ms

First-token latency on Groq LPU

TTS (Cartesia sonic-3)150ms

First-audio latency; streams thereafter

Network + WebRTC jitter80ms

Round-trip + LiveKit relay hop

drag sliders to see which stage kills the budget

Cite as: Saravanan, K. (2026). The <700ms Latency Budget for a Personal AI Voicemail Line. Kaushik Saravanan. https://www.kaushik.cv/blog/dyx-latency-budget