How does removing transcription change speech-to-speech generation latency?

This explores what happens to voice-AI response speed when you skip the usual middle step of turning speech into text first — and why that step costs time in the first place.

This explores what happens to voice-AI response speed when you skip the usual middle step of turning speech into text first. The short version: latency collapses dramatically. Most voice assistants run a pipeline — transcribe your speech to text, feed the text to a language model, then convert the model's text reply back into spoken audio. Each hop adds delay, and the transcription step in particular forces the system to wait until you've finished speaking before it can even begin. LLaMA-Omni cuts that step out entirely, generating a spoken response directly from the speech signal and hitting roughly 226 milliseconds of latency Can skipping transcription make voice assistants faster?. That's fast enough to feel like a real conversation rather than a walkie-talkie exchange.

The interesting part is *why* removing transcription buys so much speed, and that's where the corpus rewards lateral reading. Text is lossy. When you transcribe speech, you throw away the acoustic information — prosody, articulation, timing — and keep only the words. Speech embeddings preserve that richer signal, which means the model can start composing a reply before the full input has even arrived Can skipping transcription make voice assistants faster?. There's a deeper reason this works at all: self-supervised speech models don't just memorize words, they infer the physical articulatory processes that produce sound — the language-agnostic 'physics' of the vocal tract Do speech models learn language-specific sounds or universal physics?. Because the representation is grounded in how speech is actually generated rather than in a phonetic transcript, the model has something meaningful to work with directly, no text intermediary required.

There's a second lever on latency worth knowing about, because removing transcription is only half the story — the other half is how fast the model generates its output. Standard language models produce one token at a time, strictly left to right, which sets a hard floor on speed. Diffusion language models attack that floor by generating in parallel: Discrete Diffusion Forcing hybridizes block-wise autoregressive decoding with inter-block parallelism and KV-cache reuse to break the sequential-speed barrier Can diffusion language models match autoregressive inference speed?. Pair a transcription-free front end with a parallelized generator and you're cutting delay at both ends of the pipeline.

What you might not have expected to learn: this speed comes from a genuinely different relationship with time. Token-by-token text generation is sequential but 'atemporal' — there's no pause for reflection or revision between tokens, just probabilistic selection unfolding in order Does AI text generation unfold through temporal reflection?. The very property that makes these systems fast — no deliberation, no waiting, no looking back — is the same property that makes their fluency feel different from human conversation, where the time spent thinking actually changes what gets said next. The 226ms isn't just an engineering win; it's a window into what these systems trade away to be quick.

Sources 4 notes

Can skipping transcription make voice assistants faster?

LLaMA-Omni generates speech responses directly from speech input without transcribing to text first, achieving 226ms latency. This works because speech embeddings preserve acoustic information that text loses, enabling generation before full input is received.

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Can diffusion language models match autoregressive inference speed?

Discrete Diffusion Forcing breaks the speed barrier through block-wise autoregressive generation with KV cache reuse and inter-block parallel decoding. This hybrid approach recovers both the compute efficiency of AR and the parallelism advantage of diffusion.

Does AI text generation unfold through temporal reflection?

Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.

How does removing transcription change speech-to-speech generation latency?

Sources 4 notes

Next inquiring lines