INQUIRING LINE

Are RLVR models worse than non-reasoning models for subjective annotation?

This explores whether the same RLVR (reinforcement learning from verifiable rewards) training that sharpens models on math and code makes them *worse* at subjective tasks like annotation, where there's no single right answer — and the corpus suggests yes, for a specific and instructive reason.


This explores whether RLVR-trained reasoning models lose ground to plain non-reasoning models on subjective annotation — tasks where legitimate human disagreement is the signal, not noise. The corpus gives a fairly direct answer: yes, and the mechanism is the very thing that makes RLVR good at math. RLVR optimizes toward a single verifiable correct answer, and that pressure actively erodes a model's ability to represent multiple valid interpretations. One note finds RLVR-trained models degrade significantly at predicting how humans *disagree*, especially when disagreement is high — the optimization for deterministic correctness suppresses sensitivity to legitimate variance Why do reasoning models fail at predicting disagreement?. So the deficit isn't incidental; it's the predictable cost of training a model to collapse a distribution into one answer.

What makes this more than a one-paper observation is *why* RLVR collapses that way. A cluster of notes argues RLVR doesn't expand a model's reasoning so much as narrow its sampling toward answers the base model already favored — pass@k analysis shows base models actually beat RLVR models at high k Does RLVR actually expand what models can reason about?. The same on-policy pressure produces 'capability boundary collapse,' where exploitation crowds out exploration Why does RLVR training narrow a model's problem solving ability?, and RLVR is better read as activating pretrained strategies than teaching anything new What does reward learning actually do to model reasoning?. For an objective task with one answer, narrowing is a feature. For subjective annotation, where the goal is to *preserve* a spread of valid human views, narrowing is exactly the wrong move — the model becomes confidently single-minded where it should stay pluralistic.

There's a deeper reason subjective annotation resists this kind of optimization at all: annotations aren't one kind of thing. Behavioral-science work in the corpus decomposes annotation responses into three distinct signals — genuine preferences, non-attitudes, and constructed preferences — distinguishable only by whether they're consistent across measurement conditions Do all annotation responses measure the same underlying thing?. Treating them uniformly contaminates reward-model training. An RLVR model trained to find 'the' answer has no machinery for telling a stable genuine preference apart from a preference a person constructed on the spot — so it flattens precisely the structure subjective annotation is supposed to capture.

The interesting wrinkle is that 'reasoning model' and 'better at the task' may be separable in both directions. One note shows RLVR's behavioral activation of reasoning and its benchmark gains are actually distinct phenomena — the gains can be memorization on contaminated data rather than genuine reasoning Can genuine reasoning activation coexist with contaminated benchmarks?, with a companion analysis finding clean benchmarks expose much of the 'improvement' as dataset reconstruction Does RLVR success on math benchmarks reflect genuine reasoning improvement?. And a parallel finding in multimodal models shows verbose chain-of-thought *degrades* fine-grained perception because it optimizes the wrong bottleneck Does verbose chain-of-thought actually help multimodal perception tasks?. The pattern across all of these: reasoning-style RL helps when the task has a verifiable target and hurts when the real objective is perceptual fidelity or distributional faithfulness.

If there's a constructive path, the corpus hints it lies in changing what gets rewarded rather than abandoning RL. The view that RL post-training teaches a model *when* to reason rather than *how* Does RL post-training create reasoning or just deploy it? suggests subjective tasks need the model to learn *not* to deploy collapse-to-one-answer behavior — and information-theoretic, annotation-free reward schemes that score a step's contribution rather than its final correctness Can we reward reasoning steps without human annotation? point toward reward signals that don't punish legitimate ambiguity. The takeaway you might not have gone looking for: 'better at reasoning' and 'better at representing human disagreement' are not the same axis, and optimizing hard for the first can quietly cost you the second.


Sources 10 notes

Why do reasoning models fail at predicting disagreement?

RLVR-trained models degrade significantly at predicting human disagreement distributions, especially when variance is high. The optimization signal for deterministic correctness actively erodes the model's ability to represent multiple valid interpretations.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can we reward reasoning steps without human annotation?

L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.

Next inquiring lines