How do Q-value models improve action selection compared to value models?
This explores the reinforcement-learning distinction between Q-value models (which score each candidate action) and value models (which score a state regardless of which action you take) — and why action-conditioned estimates help an agent pick better moves. Worth flagging up front: this collection doesn't contain a paper that head-to-head benchmarks Q-functions against value functions, so what follows is a lateral read of the adjacent territory the corpus does cover — credit assignment, signal granularity, and where coarse rewards fail.
This explores the gap between scoring actions (Q-value models) and scoring situations (value models), and why the former tends to sharpen action selection. No single note here runs that exact comparison, so I'll be direct about it — and then show you where the corpus circles the same idea from other angles, because the deeper principle generalizes well beyond the Q-vs-V labels.
The core reason action-conditioned estimates help is granularity: a value model tells you 'this position looks promising' but not 'this move is the reason,' which leaves the agent guessing about credit. Several notes converge on exactly this failure of coarse signals. The strongest is the finding that purely numerical rewards stall reasoning models on plateaus because the number carries no information about *why* an attempt failed or *how* to fix it — and that swapping in chain-of-thought critiques unblocks them Can natural language feedback overcome numerical reward plateaus?. That's the same complaint a Q-value model answers structurally: it attaches the signal to the specific action rather than the aggregate outcome.
The corpus also shows the cost of letting reward signals stay holistic. Binary correctness rewards quietly degrade calibration because they don't distinguish a confident-wrong action from a hedged-wrong one, and the fix is to decompose the objective so the model is scored on more than one axis Does binary reward training hurt model calibration?. Likewise, breaking instruction-following into per-criterion checklists beats one global quality score, precisely because finer credit assignment stops the model from overfitting to superficial features Can breaking down instructions into checklists improve AI reward signals?. Both are the same move Q-value models make at the architectural level: replace one blunt scalar with a structure that localizes value to the choice being made.
There's a useful cautionary thread too. One note argues the exploration-exploitation trade-off many RL systems agonize over is partly a measurement artifact that only appears at the token level, and that looking at hidden-state structure dissolves it Is the exploration-exploitation trade-off actually fundamental?. The lesson for action selection: *how* and *where* you measure value can manufacture problems that aren't fundamental — a reminder that the Q-vs-V choice is partly about choosing the representation at which selection actually happens. And a sobering boundary marker: reward-driven training (RLVR) often just resamples toward solutions already latent in the base model rather than teaching genuinely new moves Does RLVR actually expand what models can reason about?, so a better action-scorer sharpens selection within an existing repertoire more than it expands it.
So the thing you maybe didn't know you wanted to know: the advantage you'd expect from Q-value models — better action selection through localized credit — shows up across this collection as a general design law. Whenever a coarse scalar reward is decomposed, made action-specific, or enriched with the reason behind it, selection improves; the Q-function is just one well-known instance of that pattern.
Sources 5 notes
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.