Reinforcement Learning for LLMs Language Understanding and Pragmatics LLM Reasoning and Architecture

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

Standard reward model training (Bradley-Terry MLE) does not force the model to consider prompts. Since different training samples typically contain distinct response pairs, the model can learn to distinguish chosen from rejected responses based on response features alone — effectively ignoring the prompt. The reward gap between chosen and rejected centers around the same values even after prompt replacement.

This is not a minor calibration issue — it is a structural failure. When the reward model only learns response-level biases (e.g., "longer is better," "confident tone is better"), it cannot generalize to novel prompt-response pairs. A response that happens to be well-written will receive high reward regardless of whether it actually answers the question. This makes RLHF optimize against phantom quality signals — response biases masquerading as prompt alignment.

The Information-Theoretic Reward Decomposition approach (Li et al., 2025) splits the reward into two components without requiring extra models: prompt-free reward (determined solely by response features) and prompt-related reward (derived from the prompt-response interaction). The prompt-free component exposes the model's bias — it reflects preference that has nothing to do with the prompt. The prompt-related component captures genuine alignment between prompt and response.

The practical fix: prioritize training samples where the prompt-related reward gap is large relative to the prompt-free reward gap. This focuses learning on samples where the prompt actually matters for the preference, rather than samples where response quality alone determines the outcome.

The finding connects to a broader pattern: Can LLM judges be fooled by fake credentials and formatting? — prompt-insensitivity is a specific mechanism underlying why judge evaluations fail. It also parallels Can model explanations help humans predict what models actually do? — in both cases, what looks like quality evaluation is actually decoupled from the semantically relevant signal.


Source: Reward Models — Information-Theoretic Reward Decomposition for Generalizable RLHF (arxiv 2504.06020)

Related concepts in this collection

Concept map
16 direct connections · 161 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reward models ignore prompt context when evaluating responses — decomposing into prompt-free and prompt-related components reveals and corrects the generalization failure