Can representation sparsity order few-shot demonstrations effectively?
Does measuring how sparse a model's hidden states are for each example provide a reliable signal for ordering few-shot demonstrations in prompts? This matters because curriculum ordering significantly affects in-context learning performance.
Once representational sparsity tracks task difficulty for a given model, sparsity itself becomes a usable signal for curriculum design. Farther the Shift, Sparser the Representation operationalizes this with Sparsity-Guided Curriculum In-Context Learning (SG-ICL), which uses the sparsity of last-layer activations to schedule few-shot demonstrations in the prompt.
The mechanism: measure how sparse the model's last hidden states are when processing each candidate few-shot example. Order them so the demonstrations escalate from sparse (high difficulty for this model) to dense (low difficulty), or vice versa depending on what the curriculum is meant to achieve. The result is considerable performance enhancements over random or naive ordering.
This is a model-internal curriculum signal. Most curriculum learning approaches require external difficulty labels — annotator effort, heuristics about problem features, or proxy measures like solution length. Sparsity sidesteps this entirely. The model itself reveals which examples are hard for it through how its representations respond. The curriculum can be tailored to the specific model being used rather than to some external notion of universal difficulty.
The technique generalizes across the in-context learning landscape. Anywhere few-shot prompting is used — classification, reasoning, agentic deployments — sparsity-derived ordering is available. It costs nothing extra at the relevant scale: hidden states are computed regardless, and reading their sparsity is a free byproduct. The only requirement is access to the activations, which is available for any white-box deployment.
For builders of LLM pipelines, this argues for instrumentation that exposes activation-sparsity statistics. The signal supports curriculum ordering, hard-example mining, confidence calibration, and likely other applications not yet identified. Sparsity is becoming a richer interpretability primitive than the static-property framing has suggested.
The deeper template is that adaptive internal phenomena — sparsity here, attention concentration elsewhere, gradient magnitudes during training — can be operationalized as signals for system behavior once they are recognized as informative rather than incidental.
Related concepts in this collection
-
Do language models sparsify their activations under difficult tasks?
When LLMs encounter unfamiliar or difficult inputs, do their internal representations become sparser rather than denser? Understanding this adaptive response could reveal how models stabilize reasoning under uncertainty.
same paper, the underlying phenomenon this method exploits
-
Is representational sparsity learned or intrinsic to neural networks?
Explores whether sparsity in neural network activations is engineered through training or emerges as a default response to unfamiliar inputs. Understanding this distinction could reshape how we design and interpret model behavior.
same paper, the developmental story
-
Why do trajectories matter more than individual examples for in-context learning?
Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.
adjacent: another structural requirement for effective ICL
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
sparsity-guided curriculum in-context learning uses representation sparsity as a scheduling signal for few-shot demonstrations