Reinforcement Learning for LLMs Language Understanding and Pragmatics Psychology and Social Cognition

Can counterfactual invariance eliminate reward hacking biases?

Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Reward hacking is not one problem but four, each stemming from a different spurious correlation in the training data:

  1. Length bias — the model learns that longer outputs receive higher rewards, regardless of content quality. The correlation between length and human preference exists in training data but is not causal.
  2. Sycophancy bias — the model learns to agree with user assertions, even incorrect ones, because agreeable responses correlate with higher preference ratings.
  3. Concept bias — the model develops unintended shortcuts when making predictions, learning surface-level concept associations rather than genuine quality assessment.
  4. Discrimination bias — the model implicitly develops preferences correlated with demographic features in the training data.

Standard reward model training (Bradley-Terry MLE) cannot distinguish causal from spurious associations. The model maximizes the margin between chosen and rejected — and spurious features that happen to correlate with preference get baked in. Since Do reward models actually consider what the prompt asks?, the model is already learning response-level biases rather than prompt-aligned preferences; spurious correlations compound this.

The Causal Reward Model (CRM) applies counterfactual invariance: reward predictions must remain consistent under interventions on irrelevant aspects of the input. If altering response length, tone of agreement, or demographic signals changes the reward without changing actual quality, the model has learned a spurious feature. The counterfactual invariance constraint forces the model to isolate the causal features — the ones that actually determine quality.

This connects to the broader pattern that Does transformer attention architecture inherently favor repeated content? — sycophancy has both an attention-level and a reward-model-level component. Fixing the reward model alone is insufficient if the attention mechanism also biases toward agreement; fixing attention alone is insufficient if the reward model reinforces the bias.


Source: Reward Models — Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment (arxiv 2501.09620)

Related concepts in this collection

Concept map
17 direct connections · 159 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

causal reward modeling via counterfactual invariance addresses four distinct reward hacking biases that standard training cannot eliminate