Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Paper · arXiv 2508.02150 · Published August 4, 2025
Reinforcement Learning

Reasoning models excel in complex problem solving but exhibit a concerning tradeoff between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models’ own internal signals to improve instruction following capabilities without external supervision.

However, reasoning models exhibit a concerning trade-off between reasoning capabilities and instruction following abilities. Fig. 1 illustrates this phenomenon. Current approaches show a clear bias: instruction-tuned models excel in instruction following while inferior in reasoning capabilities (Team, 2024), while reasoning models prioritize reasoning performance but underperform in complex instruction following tasks (Guo et al., 2025). This trade-off poses challenges for reasoning models in real-world applications that require both capabilities.

To address these challenges, we propose an efficient self-supervised RL framework that improves reasoning models’ instruction following capabilities without external supervision. First, to address sparse learning signals from challenging multi-constraint instructions (Yu et al., 2025), we decompose multi-constraint instructions into simpler instructions with incrementally increasing constraint numbers. Second, to address soft constraints that require semantic understanding (Ren et al., 2025), we establish reward signals for soft constraints without any external supervision. Third, we design an efficient constraint-wise binary classification approach that scores each constraint individually before aggregating the results, achieving computational efficiency while maintaining effectiveness.

Hard Constraint Modeling. For hard constraints that can be directly verified using explicit rules (Pyatkin et al., 2025), we adopt programmatic verification. For an input example (o, c), we define a binary constraint-level reward function:

Rh(o, c) =

(

1, if o satisfies constraint c

0, otherwise

Soft Constraint Modeling. To model soft constraints that cannot be verified through rules, avoid external supervision, and achieve efficiency, we train a binary classification reward model using self-supervised data from §3.1 without external labels. During constraint decomposition, a natural relationship emerges: for constraint ck, the response ok (generated for instruction with constraint ck) is likely to satisfy it, while ok−1 (generated for instruction without ck) does not. This allows us to construct training samples: (1) Positive sample (ok, ck, label = 1): response satisfies the constraint, (2) Negative sample (ok−1, ck, label = 0): response does not satisfy the constraint.