Reinforcement Learning for LLMs

Can reasoning during evaluation reduce judgment bias in LLM judges?

Can training language model judges to think through their evaluations, rather than pattern-matching on surface features, mitigate the four known biases that make them vulnerable to manipulation attacks?

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

J1 applies the DeepSeek-R1 RL approach — training models to reason via GRPO with verifiable rewards — to the evaluation problem rather than the generation problem. The insight: judgment is a reasoning task that benefits from the same extended thinking that improves math and coding.

The challenge is that most evaluation tasks are not naturally verifiable. Math problems have correct answers; judging whether response A is better than response B does not. J1 solves this by constructing synthetic data: for each prompt (verifiable or not), generate a high-quality and a low-quality response pair. The pairwise judgment then has a verifiable correct answer — which response is better — enabling RL training with outcome-based rewards.

GRPO with a seed prompt designed to encourage thinking produces judges that reason about their evaluations rather than pattern-matching on surface features. This directly addresses Can LLM judges be fooled by fake credentials and formatting?: if judges can be manipulated via authority bias, verbosity bias, position bias, and beauty bias, then training them to think through their judgments — explicitly evaluating content rather than surface features — should mitigate those biases.

The generalist judge design is notable: training on both verifiable (math, code) and non-verifiable (WildChat user prompts) tasks produces a judge that transfers across task types. This avoids the domain-specific evaluator trap where each task type requires its own evaluation model.

The connection to Does critiquing errors teach deeper understanding than imitating correct answers? is architectural: both papers find that training on evaluation/critique tasks produces deeper engagement with the material than training on generation. CFT (Critique Fine-Tuning) produces better understanding through critique; J1 produces better evaluation through reasoning about judgment.

Three-way convergence on reward reasoning: J1 is not an isolated finding. Three independent teams converge on the same insight — that reward modeling is a reasoning task benefiting from extended thinking:

  1. RRM (Reward Reasoning Models) — uses RL to self-evolve reward reasoning capabilities without explicit reasoning traces; introduces ELO rating and knockout tournament for multi-response scenarios
  2. RM-R1 — introduces Chain-of-Rubrics (CoR): the model first categorizes inputs as chat vs reasoning, then applies rubric-based evaluation for chat and correctness-first judgment for reasoning — task-type perception shapes evaluation strategy
  3. DeepSeek-GRM — proposes Self-Principled Critique Tuning (SPCT): the model generates principles adaptively and critiques accurately through online RL; uses a meta RM to guide voting for inference-time scaling

All three show that reward models that think before scoring produce substantially better evaluations. The convergence from independent teams strengthens the claim that Can reward models benefit from reasoning before scoring?.

Self-Taught Evaluators as fully unsupervised variant: Self-Taught Evaluators (Wang et al., 2024) removes even the need for initial synthetic data design. Starting from unlabeled instructions, the method iteratively: (1) generates contrasting response pairs via prompting (one designed to be inferior), (2) samples LLM-as-a-Judge reasoning traces and judgments, (3) filters for correct judgments, (4) trains on the filtered data. Each iteration improves the judge, which produces better training data for the next iteration. This is the self-improvement loop applied specifically to evaluation quality — a complementary approach to Why do self-improvement loops eventually stop improving?.


Source: Reasoning o1 o3 Search, Reward Models

Related concepts in this collection

Concept map
14 direct connections · 115 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl trains llm judges to think during evaluation by converting judgment tasks to verifiable problems with synthetic data