J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias.
For models, the ability to judge predictions is a vital process that is applied at all stages of development: during training and inference to provide a reward or verification signal, and during final benchmark evaluation to judge performance.
First, we convert the judgment task into a verifiable task for both verifiable (e.g., math problems) and typically subjective, non-verifiable prompts (e.g., user prompts from WildChat (Zhao et al., 2024)). This enables us to train a generalist judge across many types of tasks. To achieve this, we construct synthetic data for both categories of prompts by generating a high quality and a low quality response, such that pairwise judgment predictions are verifiable during training. We then train both the reasoning steps and judgment using Group Relative Policy Optimization Algorithm (GRPO; Shao et al. 2024), utilizing a seed prompt and reward schemes designed to encourage thinking during judgment, in analogy to the approach in DeepSeek-R1 (Guo et al., 2025).