What information does transcription destroy that direct speech-to-speech models preserve?
This explores what gets lost when speech is first converted to text before processing — the acoustic and paralinguistic signal that direct speech-to-speech models keep intact.
This explores what gets thrown away in the act of transcription itself, before any reasoning happens. The corpus has a direct anchor: systems like LLaMA-Omni skip the speech-to-text step entirely and generate spoken responses straight from speech embeddings, hitting 226ms latency Can skipping transcription make voice assistants faster?. The speed is a side effect; the deeper point is *why* it works — speech embeddings carry acoustic information that text simply has no symbols for. Transcription is a lossy projection from a rich continuous signal onto a discrete token stream, and everything that doesn't survive that projection is gone.
What exactly is in that discarded layer? Work on self-supervised speech models gives a striking answer: these models don't learn language-specific phonetic categories so much as infer the causal articulatory physics — how a vocal tract actually moves to produce sound Do speech models learn language-specific sounds or universal physics?. That's a generative, embodied process. Prosody, timing, emphasis, emotional coloring, speaker identity, hesitation — all of it lives in the acoustics and none of it has a clean home in a transcript. Once you write speech down as words, you've kept the *what* and deleted most of the *how it was said*.
There's a sharper, less obvious loss too. Text isn't a neutral container — it actively homogenizes. Research on "Adam's Law" shows that language models flatten distinct inputs toward high-frequency forms the model handles best, filtering out distinctiveness at the input side Does high-frequency text homogenize user input before generation?. Transcription is the first and most aggressive instance of this: the same word from two very different speakers, in two very different emotional registers, collapses to one identical string. Speech-to-speech models never force that collapse, so the individuating texture of an utterance survives into the model's working representation.
It's worth noticing the tension with the compression view of language modeling, where text-trained models act as near-lossless compressors and even beat specialized tools on images and audio Can text-trained models compress images better than specialized tools?. That losslessness is about how efficiently a model encodes a signal it's *given* — it says nothing about the signal destroyed upstream when continuous audio was forced through a phonetic-to-orthographic bottleneck. The compression happens after the amputation.
If you want to follow this thread further, the orality-versus-literacy framing in the corpus is a fertile doorway: transformer knowledge has been described as flowing performance rather than stored archive, closer to oral culture than written record Do transformer models store knowledge or generate it continuously?. Direct speech models, in that light, aren't just faster pipelines — they're keeping language in its performed, embodied form rather than freezing it into the written abstraction we long mistook for the thing itself.
Sources 5 notes
LLaMA-Omni generates speech responses directly from speech input without transcribing to text first, achieving 226ms latency. This works because speech embeddings preserve acoustic information that text loses, enabling generation before full input is received.
Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.
Adam's Law shows LLMs flatten distinct prompts at comprehension time as users rephrase toward higher-frequency forms the model handles best. The same distributional property that creates accuracy on common tasks filters out distinctiveness on the input side.
Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.