Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. This sparsity–difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design Sparsity-Guided Curriculum In-Context Learning (SG-ICL), a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements.
We identify sparsity, the property where a high-dimensional representation is dominated by a small subset of active units, as a promising candidate for this signal. While sparsity is a pervasive phenomenon extensively discussed as evidence for specialization or modularity, and has been examined in LLMs regarding intrinsic dimensionality, most interpretability work treats sparsity as a largely static background property, rather than as an explanatory variable that changes systematically with task conditions and can therefore explain differences in behavior. In this work, we make progress on these questions by uncovering a robust connection between representational sparsity and task difficulty. Concretely, when a language model is prompted with a task, as task difficulty increases, the model's representations become systematically sparser.
We identify a fundamental link between representation density and data familiarity. Our analysis suggests that high activation density is a learned attribute: as models master training data, they consolidate representations. Importantly, this trend already emerges during pretraining, without any task-specific fine-tuning, suggesting it is a general property of learned representations rather than a downstream artifact. Conversely, sparsity serves as the intrinsic default state for harder or less familiar inputs. This positions sparsity as a candidate organizing principle for studying how internal computation adapts under increased reasoning demands in the language model.
In this work, we have established a fundamental connection between the internal representation geometry of Large Language Models and the difficulty of the tasks they face. Through a rigorous analysis across diverse models, benchmarks, and OOD settings, we validated the phenomenon that "the farther the shift, the sparser the representation." Our findings reveal that this sparsification is not a random artifact but a consistent, adaptive mechanism localized primarily in the final transformer layers, acting as a selective filter to stabilize reasoning under uncertainty. Ultimately, our study bridges the gap between mechanistic interpretability and the reasoning domain, offering a new perspective on how LLMs internalize complexity.
{{CONTEXT}}
End of the context
Simulate an expert’s in-depth thought process as they analyze the above
context, focusing on complex and informative aspects. Skip trivial
details. Use Feynman technique whenever possible to ensure a deep
understanding.
• Dynamic Allocation of Training Compute Valuable tokens can be difficult to learn in a generalizable manner by training on them directly, as exemplified in Figure 1. Thinking augmentation breaks down complex tokens into smaller, more explainable steps, thereby effectively allocating more training compute to them. This is analogous to test-time scaling but applied during training instead of inference. Empirical evidences in Section 4 shows that thinking trajectories tend to be longer for high-value domains and documents, which functions as a natural up-sampling mechanism.