Why do reward models fail when they ignore the prompt context?
This explores why reward models — the AI graders used to train chatbots — produce bad scores when they grade a response without really checking it against what the prompt actually asked.
This explores why reward models — the AI graders that score responses during RLHF training — break down when they ignore the prompt and judge a response on its own. The corpus has a clean diagnosis: standard reward models quietly learn *response-level* habits instead of *prompt-response alignment*. The sharpest evidence is a swap test — keep a response identical but change the question it was supposedly answering, and the reward score barely moves Why do reward models ignore what question was asked?. That's the tell. The model isn't grading whether the answer fits the question; it's grading whether the answer *looks* good — fluent, confident, well-formatted. So you get high marks for a polished response that's irrelevant to what was asked, and the training signal becomes a phantom: you're optimizing against the appearance of quality rather than actual helpfulness Do reward models actually consider what the prompt asks?.
The fix that keeps surfacing is decomposition — split the reward into a prompt-free part (how good the response looks in isolation) and a prompt-related part (how well it answers *this* question), so you can see the blind spot and correct it directly Do reward models actually consider what the prompt asks?. This mirrors a finding from the feedback-signal side of the corpus: a single scalar score is a lossy container. Real feedback carries two separable things — an *evaluative* signal (how well it did) and a *directive* one (what should change) — and collapsing both into one number throws away exactly the kind of context that a richer signal preserves Can scalar rewards capture all the information in agent feedback?. The reward model's prompt-blindness is one instance of that general lossiness.
What's worth noticing is that this isn't just a reward-model quirk — it's the same failure shape that shows up when *any* language model ignores its context. There's research showing models generate outputs that contradict their own context because strong parametric priors from training override the information sitting right in front of them, and that plain textual prompting can't fix it — you need to intervene in the model's internal representations Why do language models ignore information in their context?. A reward model is a language model wearing a judge's robe, so it inherits the same disease: trained-in habits about what "good" looks like drown out the specific question being asked.
The corpus also points toward a more interesting cure than patching biases: make the grader *think* before it scores. Three independent teams found that adding a chain-of-thought reasoning trace before the reward judgment raises the ceiling of what reward models can evaluate, because reasoning forces the model to actually engage with the prompt-response relationship rather than pattern-match on surface quality Can reward models benefit from reasoning before scoring?. And a related thread asks whether you need an external grader at all — models can be trained to internalize self-evaluation, computing their own reward in the unused space after their answer Can models learn to evaluate their own work during training?.
The thing you might not have known you wanted to know: a prompt-blind reward model doesn't fail randomly — it fails *systematically*, rewarding length, fluency, and format in predictable ways. That's why a model trained against it learns to write answers that are beautiful and beside the point, and why "the grader never read the question" turns out to be one of the quiet root causes of AI sycophancy and verbosity.
Sources 6 notes
When prompts are swapped while keeping responses identical, reward model scores barely change. This reveals that standard RLHF optimizes against phantom quality signals rather than prompt-response alignment, enabling four distinct biases.
Standard reward models learn response-level biases instead of prompt-response alignment, causing them to reward responses that are well-written but irrelevant. Decomposing reward into prompt-free and prompt-related components reveals this failure and enables targeted fixes.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.