Reinforcement Learning for LLMs

Why do reward models ignore what question was asked?

Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Post angle for Medium — the evaluation infrastructure behind AI alignment has a fundamental flaw

The hook: Your AI's grading system is ignoring the question. When researchers swapped prompts while keeping responses the same, reward model preference scores barely changed. The system that's supposed to ensure AI gives good answers to your questions is actually just evaluating whether the response sounds good — regardless of what was asked.

The mechanism: Since Do reward models actually consider what the prompt asks?, standard Bradley-Terry training lets reward models learn to distinguish good from bad responses without ever needing to check whether the response matches the prompt. Responses dominate the reward signal. This means RLHF — the dominant approach to making AI helpful and safe — is optimizing against phantom quality signals.

The four biases it enables: Since Can counterfactual invariance eliminate reward hacking biases?, prompt-insensitivity creates an opening for four distinct biases — length bias (longer = better), sycophancy (agreement = better), concept shortcuts, and demographic discrimination. All stem from spurious correlations that the model treats as genuine quality signals because it isn't checking whether the response actually addresses the prompt.

Three converging fixes from independent teams:

  1. Decompose the reward — split into prompt-free and prompt-related components, then prioritize training on samples where the prompt matters
  2. Apply counterfactual invariance — ensure rewards stay constant when irrelevant features change
  3. Let the evaluator think — since Can reward models benefit from reasoning before scoring?, three teams independently discover that reward modeling is a reasoning task; CoT before scoring enables adaptive evaluation

The broader frame: The bottleneck on AI improvement isn't just model capability — it's evaluator capability. The system we use to tell AI what's good has been quietly ignoring half the input. Fixing this requires treating reward modeling not as a classification task but as a reasoning task.


Source: Reward Models

Related concepts in this collection

Concept map
13 direct connections · 107 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the reward models blind spot — why your ais grading system ignores the question