Do reward models actually consider what the prompt asks?
Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
Standard reward model training (Bradley-Terry MLE) does not force the model to consider prompts. Since different training samples typically contain distinct response pairs, the model can learn to distinguish chosen from rejected responses based on response features alone — effectively ignoring the prompt. The reward gap between chosen and rejected centers around the same values even after prompt replacement.
This is not a minor calibration issue — it is a structural failure. When the reward model only learns response-level biases (e.g., "longer is better," "confident tone is better"), it cannot generalize to novel prompt-response pairs. A response that happens to be well-written will receive high reward regardless of whether it actually answers the question. This makes RLHF optimize against phantom quality signals — response biases masquerading as prompt alignment.
The Information-Theoretic Reward Decomposition approach (Li et al., 2025) splits the reward into two components without requiring extra models: prompt-free reward (determined solely by response features) and prompt-related reward (derived from the prompt-response interaction). The prompt-free component exposes the model's bias — it reflects preference that has nothing to do with the prompt. The prompt-related component captures genuine alignment between prompt and response.
The practical fix: prioritize training samples where the prompt-related reward gap is large relative to the prompt-free reward gap. This focuses learning on samples where the prompt actually matters for the preference, rather than samples where response quality alone determines the outcome.
The finding connects to a broader pattern: Can LLM judges be fooled by fake credentials and formatting? — prompt-insensitivity is a specific mechanism underlying why judge evaluations fail. It also parallels Can model explanations help humans predict what models actually do? — in both cases, what looks like quality evaluation is actually decoupled from the semantically relevant signal.
Source: Reward Models — Information-Theoretic Reward Decomposition for Generalizable RLHF (arxiv 2504.06020)
Related concepts in this collection
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
prompt-insensitivity is a mechanism underlying exploitable judge biases
-
Can model explanations help humans predict what models actually do?
Do explanations that sound plausible to humans actually help them forecast model behavior on new cases? Understanding this gap matters because RLHF optimizes for plausible explanations, not predictive ones.
parallel decoupling: explanation quality decoupled from actual explanation precision, reward quality decoupled from actual prompt relevance
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
response-level bias may compound with attention-level bias
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
reward models ignore prompt context when evaluating responses — decomposing into prompt-free and prompt-related components reveals and corrects the generalization failure