Conversational AI Systems LLM Reasoning and Architecture

Can skipping transcription make voice assistants faster?

Voice assistants traditionally convert speech to text before responding. Does eliminating that middle step reduce latency enough to matter for real-time conversation?

Note · 2026-05-03 · sourced from Speech Voice

Conventional voice assistants run a three-stage pipeline: automatic speech recognition (ASR) converts speech to text, an LLM generates a text response, and text-to-speech (TTS) renders that response as audio. Each stage adds latency and propagates errors, and the total response time is dominated by stage-by-stage processing rather than by the LLM's reasoning. LLaMA-Omni eliminates the transcription step entirely. It integrates a pretrained speech encoder, a speech adaptor, an LLM (built on Llama-3.1-8B-Instruct), and a streaming speech decoder so that the system generates text and speech responses directly from speech instructions, achieving a response latency as low as 226 milliseconds.

The architectural lesson is that transcription is not a free intermediate representation — it is a serialization step that destroys prosodic information and forces full-utterance processing before generation can begin. By passing speech embeddings directly into the LLM via an adaptor, LLaMA-Omni preserves the acoustic information the LLM might use — including the articulatory substrate identified in Do speech models learn language-specific sounds or universal physics? — and lets generation begin as soon as enough of the input has been encoded, without waiting for a complete textual transcript.

The supporting piece is the InstructS2S-200K dataset of 200K speech instructions paired with speech responses, which makes the alignment trainable. Without paired speech-to-speech instruction data, end-to-end training has nothing to optimize. The general principle: when latency dominates user experience, the right intervention is to remove pipeline stages, not to optimize each stage in isolation.


Source: Speech Voice

Related concepts in this collection

Concept map
13 direct connections · 112 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

eliminating speech transcription enables 226 millisecond response latency — direct speech-to-speech generation collapses the cascade that ASR-LLM-TTS pipelines impose