Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Paper · arXiv 2605.28388
RL with Verifiable Rewards (RLVR)Mechanistic InterpretabilityReinforcement LearningReasoning Architectures

Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sample Difficulty in RLVR remains poorly understood. In this paper, we investigate RLVR through the lens of difficulty-wise and one-sample analysis. We find that sample difficulty has a non-monotonic effect on RLVR: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, whereas overly hard problems often provide weak learning signals, induce degenerate behaviors such as answer repetition or skipping necessary computation, and can ultimately degrade the model's pre-existing capabilities. Beyond the obverse of response, we further analyze the model's internal feature dynamics using Temporal Sparse Autoencoders (T-SAE). Easy problems mainly reinforce direct-answer and basic-computation features while suppressing deliberative-reasoning features; hard problems activate reasoning-related features but become useful only when successful trajectories are sampled; medium-difficulty problems provide a more balanced signal, strengthening both computation and multi-step reasoning features. Motivated by these findings, we propose difficulty-adaptive strategies for hard-sample utilization, using backward-reasoning reformulation and T-SAE-guided training signals to improve reward density and credit assignment during RLVR. Overall, our results identify sample difficulty as a key factor governing both the optimization dynamics and representation evolution of RLVR.

Large language models (LLMs) have made significant achievements across a variety of tasks, ranging from mathematics, programming, and scientific reasoning. Much of this progress has been further amplified by post-training, which adapts pretrained models to elicit stronger reasoning and problem-solving behaviors. Among post-training techniques, Reinforcement Learning with Verifiable Reward (RLVR) has emerged as the dominant post-training paradigm, spawning algorithmic innovations such as GRPO, DAPO, and VAPO, as well as extensions to diverse application areas and surprising empirical findings. However, most RLVR methods share a fundamental design choice: they treat samples uniformly regardless of difficulty. Under group-relative advantage normalization, only samples that induce mixed rollout outcomes provide meaningful relative-advantage signals. Extremely hard and easy samples therefore contribute little direct learning signal.

This paper studies how sample difficulty shapes RLVR through one-sample dynamics. Specifically, we begin with controlled subset training and one-sample amplification experiments to characterize how samples of different difficulty levels affect reward dynamics, optimization behavior, and downstream reasoning performance. Our results reveal a non-monotonic effect of difficulty: medium-difficulty samples produce the strongest and most stable gains, whereas very easy and overly hard samples provide weak relative-advantage signals. Hard samples can be especially damaging, as accidentally rewarded trajectories caused by shortcuts or incomplete reasoning may be amplified by group-relative normalization, leading to biased updates that reinforce flawed reasoning patterns. Furthermore, this asymmetry is also reflected in the model's internal feature dynamics. To probe these dynamics, we introduce a Temporal Sparse Autoencoder (T-SAE) to extract sparse reasoning features from activations along reasoning trajectories.

In this work, we presented a mechanistic study of how samples across the full difficulty spectrum shape RLVR training for LLMs. Through controlled experiments and one-sample amplification, we demonstrated that hard samples produce sparse, low-quality learning signals that destabilize optimization and degrade performance, while medium-difficulty samples yield the strongest data efficiency. Critically, we showed that sample informativeness is not a static property but depends on the dynamic interaction between task difficulty and the model's evolving capability. Moving beyond reward-level analysis, we employed Temporal Sparse Autoencoders (T-SAEs) to track the evolution of sparse feature activations throughout RL training, revealing that samples of different difficulty levels selectively reinforce or suppress distinct reasoning feature dynamics that are invisible from advantage signals alone. Guided by these mechanistic insights, we proposed two complementary interventions: a backward reasoning reformulation that converts hard samples into learnable inverse problems, and a T-SAE feature-based training signal that supports both token-level weighting and direct hard-sample rewards.