Why do multimodal models fail on rare and underrepresented concepts?
This explores why vision-language models stumble on concepts that rarely appeared during training — and the corpus suggests the answer is less about 'rarity' as a property of the world and more about frequency as the hidden engine of what looks like understanding.
This explores why multimodal models fail on rare or underrepresented concepts. The most direct answer in the corpus is uncomfortable: what we call 'zero-shot' performance is mostly a frequency effect in disguise. Across 34 models and 5 datasets, downstream accuracy tracks how often a test concept showed up in pretraining, and getting linear gains requires *exponentially* more data — so rare concepts aren't a small gap to close, they're on the far end of a brutally diminishing curve Does multimodal zero-shot performance actually generalize or interpolate?. The model isn't generalizing to the rare concept; it never had enough exposure to interpolate it in the first place.
This isn't unique to images. Language models show the same tell: given two paraphrases that mean exactly the same thing, models systematically prefer the higher-frequency surface form, across math, translation, and commonsense tasks Do language models really understand meaning or just surface frequency?. The mechanism underneath both is autoregression — these are probability machines, and you can predict in advance that low-probability targets will be hard even when they're logically trivial Can we predict where language models will fail?. Rare concepts are, almost by definition, low-probability targets. So 'failure on rare concepts' is partly just the visible edge of a statistical-mass mechanism that's always running.
But frequency isn't the whole story, and this is where it gets interesting. One line of work argues the failure isn't really about complexity or even rarity per se, but about *instance-level unfamiliarity* — models fit patterns tied to specific seen instances rather than learning a transferable rule, so a chain succeeds only when it resembles something already in the data Do language models fail at reasoning due to complexity or novelty?. That reframes 'rare concept' as 'concept whose nearest training neighbor is too far away.' And the breakdown can be stranger than a clean miss: models can correctly *explain* a concept they then fail to *apply*, and even recognize their own failure — a pattern suggesting explanation and execution run on disconnected pathways, not a single coherent 'understanding' that's simply thin for rare items Can LLMs understand concepts they cannot apply?.
There are also failure modes specific to the multimodal seam. Vision and language can actively *compete* for a model's fixed capacity — but the corpus argues this is architectural rather than fundamental: it comes from caption distribution shift and rigid dense capacity allocation, and Mixture-of-Experts routing largely dissolves it by giving each token its own capacity Can we solve modality competition through architectural design?. Relatedly, the instinct to throw more reasoning at hard cases backfires for perception: verbose chain-of-thought helps abstract reasoning but *degrades* fine-grained visual tasks, because the real bottleneck on a rare visual concept is where the model looks, not how much it talks Does verbose chain-of-thought actually help multimodal perception tasks?. And even when you hand the model the right information in context, strong pretraining associations can override it — so a rare concept gets steamrolled by a more frequent prior the model 'already knows' Why do language models ignore information in their context?.
The payoff worth taking away: the most promising fixes in this corpus don't try to make the rare concept more frequent — they route around the frequency dependence entirely. Instead of training a recognition model on rare classes, describe the unknown image in natural language and *retrieve* a match from a text-indexed database; the description bridges the gap that raw embedding similarity can't Can describing images in text improve zero-shot recognition?. That's the quiet lesson under all of this: if a model's competence is really a frequency table, the way to handle the long tail isn't a bigger table — it's an architecture that can look something up.
Sources 9 notes
Across 34 models and 5 datasets, multimodal models require exponentially more pretraining data for linear performance gains on downstream tasks. Performance correlates with how often test concepts appeared during pretraining, not genuine generalization ability.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Modality competition arises from caption distributional shift and rigid dense capacity allocation, not from vision and language being fundamentally incompatible. Mixture of Experts resolves the architectural bottleneck by allocating capacity per token, enabling modalities to coexist without competing.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.