What information does transcription destroy that direct speech-to-speech models preserve?

This explores what gets lost when speech is first converted to text before processing — the acoustic and paralinguistic signal that direct speech-to-speech models keep intact.

This explores what gets thrown away in the act of transcription itself, before any reasoning happens. The corpus has a direct anchor: systems like LLaMA-Omni skip the speech-to-text step entirely and generate spoken responses straight from speech embeddings, hitting 226ms latency Can skipping transcription make voice assistants faster?. The speed is a side effect; the deeper point is *why* it works — speech embeddings carry acoustic information that text simply has no symbols for. Transcription is a lossy projection from a rich continuous signal onto a discrete token stream, and everything that doesn't survive that projection is gone.

What exactly is in that discarded layer? Work on self-supervised speech models gives a striking answer: these models don't learn language-specific phonetic categories so much as infer the causal articulatory physics — how a vocal tract actually moves to produce sound Do speech models learn language-specific sounds or universal physics?. That's a generative, embodied process. Prosody, timing, emphasis, emotional coloring, speaker identity, hesitation — all of it lives in the acoustics and none of it has a clean home in a transcript. Once you write speech down as words, you've kept the *what* and deleted most of the *how it was said*.

There's a sharper, less obvious loss too. Text isn't a neutral container — it actively homogenizes. Research on "Adam's Law" shows that language models flatten distinct inputs toward high-frequency forms the model handles best, filtering out distinctiveness at the input side Does high-frequency text homogenize user input before generation?. Transcription is the first and most aggressive instance of this: the same word from two very different speakers, in two very different emotional registers, collapses to one identical string. Speech-to-speech models never force that collapse, so the individuating texture of an utterance survives into the model's working representation.

It's worth noticing the tension with the compression view of language modeling, where text-trained models act as near-lossless compressors and even beat specialized tools on images and audio Can text-trained models compress images better than specialized tools?. That losslessness is about how efficiently a model encodes a signal it's *given* — it says nothing about the signal destroyed upstream when continuous audio was forced through a phonetic-to-orthographic bottleneck. The compression happens after the amputation.

If you want to follow this thread further, the orality-versus-literacy framing in the corpus is a fertile doorway: transformer knowledge has been described as flowing performance rather than stored archive, closer to oral culture than written record Do transformer models store knowledge or generate it continuously?. Direct speech models, in that light, aren't just faster pipelines — they're keeping language in its performed, embodied form rather than freezing it into the written abstraction we long mistook for the thing itself.

Sources 5 notes

Can skipping transcription make voice assistants faster?

LLaMA-Omni generates speech responses directly from speech input without transcribing to text first, achieving 226ms latency. This works because speech embeddings preserve acoustic information that text loses, enabling generation before full input is received.

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Does high-frequency text homogenize user input before generation?

Adam's Law shows LLMs flatten distinct prompts at comprehension time as users rephrase toward higher-frequency forms the model handles best. The same distributional property that creates accuracy on common tasks filters out distinctiveness on the input side.

Can text-trained models compress images better than specialized tools?

Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a speech-language-modeling researcher. The question remains open: What information does transcription destroy that direct speech-to-speech models preserve?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable constraints unless re-tested.
• Self-supervised speech models infer causal articulatory physics (vocal-tract kinematics), not just phonetic categories — information text has no symbols for (~2023–2024, arXiv:2310.10788).
• Direct speech-to-speech systems (e.g., LLaMA-Omni) achieve 226ms latency by skipping transcription, preserving acoustic embeddings through the model; prosody, timing, emotion, speaker identity, hesitation vanish in text-only pipelines (~2024, arXiv:2409.06666).
• Language models homogenize distinct inputs toward high-frequency forms; transcription is the "first and most aggressive" collapse point, erasing individuating texture (~2026, arXiv:2604.02176).
• Text-trained models achieve near-lossless compression of *given* signals; this says nothing about signal destroyed upstream when continuous audio was forced through phonetic-to-orthographic bottleneck (~2023, arXiv:2309.10668).

Anchor papers (verify; mind their dates):
• arXiv:2310.10788 (2023) — Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
• arXiv:2409.06666 (2024) — LLaMA-Omni: Seamless Speech Interaction with Large Language Models
• arXiv:2604.02176 (2026) — Adam's Law: Textual Frequency Law on Large Language Models
• arXiv:2309.10668 (2023) — Language Modeling is Compression

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer speech-to-speech architectures, multi-modal embeddings, improved speech SSL models, or evaluation suites have since preserved or recovered prosodic/emotional/speaker information *within* text-based pipelines (e.g., phonetic markup, special tokens, layered representations). Separate the durable question (what is *theoretically* lost in discrete tokenization?) from the perishable limitation (can modern systems practically recover it?). Cite what resolved or deepened the constraint.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any system shown that transcription + downstream processing recovers the losses catalogued above, or proven the losses irrelevant to downstream tasks?

(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., can fine-grained acoustic token vocabularies + retrieval restore what traditional ASR erases? Do multi-modal embeddings that preserve speech structure but use text-based reasoning offer a middle ground?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What information does transcription destroy that direct speech-to-speech models preserve?

Sources 5 notes

Next inquiring lines