Can dense models partially address modality friction without full expert specialization?
This explores whether ordinary dense models can take some of the edge off 'modality friction' — the mismatch that shows up when text-trained reasoning gets applied to vision, perception, or physical grounding — using lightweight internal tricks rather than building out dedicated expert modules or mixture-of-experts specialization.
This explores whether a plain dense model can partially absorb cross-modal friction on its own, short of full expert specialization. The corpus says: partially, yes — but where the friction actually lives matters, and dense models have some surprisingly capable built-in machinery that looks like specialization without being separate experts.
Start with the optimistic side. Dense models already do something specialization-like internally. Their hidden states sparsify adaptively when a task drifts out-of-distribution, narrowing to a localized subset of activations that stabilizes performance rather than breaking down Do language models sparsify their activations under difficult tasks?. And distinct behaviors turn out to occupy distinct, steerable directions in activation space — you can pull a single vector from a handful of examples and compress reasoning verbosity by two-thirds with no retraining Can we steer reasoning toward brevity without retraining?. These are cheap, dense-model-native ways of routing capacity toward a task. They sit a long way short of composing dedicated expert vectors at inference the way singular-value tuning does Can models dynamically activate expert skills at inference time?, yet they show a dense model can reallocate itself without a separate expert per skill.
But here's the catch the corpus keeps surfacing: modality friction often isn't a routing problem at all. When verbose chain-of-thought gets bolted onto multimodal perception tasks, it actively hurts — because the real bottleneck is how visual attention is allocated, not how much the model verbalizes. Text-token reasoning optimizes the wrong policy target entirely Does verbose chain-of-thought actually help multimodal perception tasks?. So the lightweight interventions that work beautifully for text-reasoning friction can backfire on perception, because they're operating in the wrong space.
And underneath that sits a harder limit. Text itself is a lossy abstraction — it strips out the physics, geometry, and causality present in reality, leaving the model manipulating symbols without grounding in their source dynamics Are text-only language models fundamentally limited by abstraction?. No amount of clever activation steering inside a dense text model recovers information that was never encoded. That's friction you can't patch internally; it's friction from the modality boundary itself, which the paper argues only multimodal training can address. The systematic structural blind spots dense models show even within language — failing predictably as syntactic depth increases Why do large language models fail at complex linguistic tasks? — are a warning that surface-pattern machinery has real ceilings.
So the honest answer is split by where the friction comes from. For friction that's really a reallocation problem — directing existing capacity toward an unfamiliar task — dense models can self-adjust cheaply, and even structured fine-tuning (knowledge-graph curricula beating raw scale Can knowledge graphs teach models deep domain expertise?) offers a non-MoE route to depth. For friction that's a grounding or perception problem, the dense model can't talk its way across the gap, and the partial fixes can make things worse. The thing you didn't know you wanted to know: 'modality friction' isn't one phenomenon — it's at least two, and dense models quietly solve one of them while being structurally blocked from the other.
Sources 7 notes
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.