Reasoning and Learning Architectures Reasoning and Knowledge Language Understanding and Reasoning

What reasoning features does each difficulty level reinforce?

When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.

Note · 2026-05-28 · sourced from RLVR
What does reward learning actually do to model reasoning?

Reward curves and advantage magnitudes tell you whether training is improving accuracy, but they are silent about what kind of reasoning is being reinforced. Reading RLVR through a Temporal Sparse Autoencoder — extracting sparse reasoning features from activations along the reasoning trajectory — exposes a structured story that the scalar signals hide. Difficulty does not just change how much the model learns; it changes which internal features get strengthened versus suppressed.

The breakdown: easy problems mainly reinforce direct-answer and basic-computation features while actively suppressing deliberative-reasoning features — the model learns to shortcut because shortcutting works. Hard problems activate reasoning-related features, but those features become useful only on the rare successful trajectory, so most hard-sample updates do not consolidate them. Medium-difficulty problems provide a balanced signal, strengthening both computation and multi-step reasoning features at once. The same accuracy gain can therefore correspond to opposite internal changes depending on the difficulty of the data producing it.

Why it matters: it warns that benchmark improvement is an ambiguous summary statistic. Two RLVR runs can post similar accuracy gains while one has built multi-step reasoning machinery and the other has sharpened answer-shortcutting and let deliberation atrophy. The feature-level view is what distinguishes them, and it is the basis for difficulty-adaptive interventions that target feature consolidation directly (e.g., feature-guided training signals). The connection to interpretability work is direct: this is the same SAE-feature lens that lets you steer or read reasoning, now used to audit what a training regime is silently rewarding. The limitation is that T-SAE features are themselves a learned, imperfect decomposition — the "reasoning feature" labels are interpretive, not ground truth.


— "Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs", https://arxiv.org/abs/2605.28388

Related concepts in this collection

Concept map
12 direct connections · 107 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

different difficulty levels selectively reinforce or suppress distinct reasoning features invisible from advantage signals alone