Can skipping transcription make voice assistants faster?
Voice assistants traditionally convert speech to text before responding. Does eliminating that middle step reduce latency enough to matter for real-time conversation?
Conventional voice assistants run a three-stage pipeline: automatic speech recognition (ASR) converts speech to text, an LLM generates a text response, and text-to-speech (TTS) renders that response as audio. Each stage adds latency and propagates errors, and the total response time is dominated by stage-by-stage processing rather than by the LLM's reasoning. LLaMA-Omni eliminates the transcription step entirely. It integrates a pretrained speech encoder, a speech adaptor, an LLM (built on Llama-3.1-8B-Instruct), and a streaming speech decoder so that the system generates text and speech responses directly from speech instructions, achieving a response latency as low as 226 milliseconds.
The architectural lesson is that transcription is not a free intermediate representation — it is a serialization step that destroys prosodic information and forces full-utterance processing before generation can begin. By passing speech embeddings directly into the LLM via an adaptor, LLaMA-Omni preserves the acoustic information the LLM might use — including the articulatory substrate identified in Do speech models learn language-specific sounds or universal physics? — and lets generation begin as soon as enough of the input has been encoded, without waiting for a complete textual transcript.
The supporting piece is the InstructS2S-200K dataset of 200K speech instructions paired with speech responses, which makes the alignment trainable. Without paired speech-to-speech instruction data, end-to-end training has nothing to optimize. The general principle: when latency dominates user experience, the right intervention is to remove pipeline stages, not to optimize each stage in isolation.
Source: Speech Voice
Related concepts in this collection
-
Why do dialogue systems need probabilistic reasoning?
Explores whether deterministic flowchart-based dialogue systems can handle realistic speech recognition error rates of 15-30 percent, and what alternative approaches might be necessary.
contrasts: POMDPs absorb ASR noise probabilistically; LLaMA-Omni removes the ASR stage; the two represent compensate-for vs design-around responses to the same problem
-
Do speech models learn language-specific sounds or universal physics?
Exploring whether self-supervised speech models encode phonetic categories tied to specific languages or instead capture the underlying vocal-tract physics common to all humans. This matters for understanding why these models transfer across languages without retraining.
supports: gives a principled reason why bypassing transcription should help — the speech encoder carries articulatory structure that text loses
-
What speech tasks remain without standardized benchmarks?
Speech evaluation has strong benchmarks for transcription and translation, but broader comprehension and reasoning tasks over audio lack standardized measurement. This gap may constrain which capabilities researchers prioritize building.
extends: same Voxtral/Speech Voice cluster; transcription-centric benchmarks impede evaluation of exactly the speech-to-speech capabilities LLaMA-Omni demonstrates
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
extends: both reduce user-perceived latency by relocating where compute happens; LLaMA-Omni removes a pipeline stage; sleep-time compute precomputes one in advance
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
eliminating speech transcription enables 226 millisecond response latency — direct speech-to-speech generation collapses the cascade that ASR-LLM-TTS pipelines impose