Checklists Are Better Than Reward Models For Aligning Language Models

Paper · arXiv 2507.18624 · Published July 24, 2025

Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmfulness”. In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose “Reinforcement Learning from Checklist Feedback” (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item—using both AI judges and specialized verifier programs—then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models’ support of queries that express a multitude of needs.2

Today’s models are almost universally trained to follow instructions via a two-step process: instruction finetuning, followed by reinforcement learning from human feedback (RLHF). Instruction finetuning, where the model is trained to mimic responses generated by annotators [Raffel et al., 2019], has historically been the primary workhorse for imbuing language models with some amount of instruction-following ability [Wang et al., 2022, Chung et al., 2022, Xu et al., 2024, Lambert et al., 2024a]. Model developers then frequently employ RLHF, where the model is trained to generate responses that look more like labeled “good” responses than “bad” responses, as a refinement step to decrease the likelihood that the model exhibits predefined poor behaviors (typically harmful behaviors) [Ziegler et al., 2019, Bai et al., 2022]. Unlike “verifiable” tasks where reinforcement learning is a workhorse [DeepSeek-AI et al., 2025, Lambert et al., 2024a, Pyatkin et al., 2025], reinforcement learning remains difficult to utilize for ambiguous or “non-verifiable” tasks, such as instruction following. What would it take to make RL a general-purpose solution at eliciting desirable behaviors in subjective or open-ended settings?

We believe the solution must involve producing better reward signals. Recent work on RL for language model alignment has focused on automatically obtaining feedback on model behavior, either by (1) exclusively using instructions with verifiable answers [Dong et al., 2024, Pyatkin et al., 2025], (2) grading responses with specially-trained reward models [Wang et al., 2024a, Eisenstein et al., 2023], or (3) distilling preferences from a larger prompted model [Bai et al., 2022, Tunstall et al., 2023]. Using instructions with verifiable answers restricts the aspects of response quality that can be learned to exact answer correctness or syntax/format adherence (ignoring other qualities, e.g. topicality or style). Specially-trained reward models are powerful, but their notion of rewards can be arbitrary, leading to reward hacking [Eisenstein et al., 2023]. When distilling preferences from a larger prompted LM, that LM is challenged with deciding what aspects to consider when grading a response, reducing the so-called “generator-verifier gap” that enables RL [Swamy et al., 2025]. Even if multiple prompts are written to capture values of interest, this assumes that a fixed set of criteria can be comprehensive [Bai et al., 2022, Glaese et al., 2022a].

In this paper, we ask: “how can we grade responses to instructions in a manner that is automatic (requires no human annotation), flexible (considers all aspects of response quality), intuitive (aligned with perceptible differences in responses), and applicable to any instruction or response, to enable more effective use of RL in language model alignment?” As an answer, we propose extracting dynamic checklists from instructions – an approach we term Reinforcement Learning from Checklist Feedback (RLCF). This approach makes it likely that our evaluation focuses on flexible lists of distinct criteria while also reducing the problem of grading responses to answering a series of specific yes/no questions, which can be answered by an AI judge or by executing a verification program.

Our key contributions are:

We describe a new and improved algorithm for automatically generating checklists at scale.
We construct WildChecklists, a dataset consisting of 130,000 instructions and corresponding checklists (generated synthetically). When applicable, we accompany items in each checklist with a verification program to facilitate automatic evaluation. We plan to release this dataset to the community as an artifact for future study.
We describe a new algorithm for grading responses according to checklists, using language models and code, and we show to use this algorithm to rank responses for preference tuning.

Extract checklists per instruction. We examine two methods to extract checklists:

• Direct: We simply prompt an LM to extract a checklist from a given instruction [Cook et al., 2024]. This approach is intuitive and simple but risks repeating the original instruction via these individual criteria, which may limit comprehensiveness and objectiveness.

• Candidate-based: We view a requirement as any aspect of an instruction that, when absent, causes a response to fail. We propose a two-stage approach: produce responses of varying quality, then prompt an LM to write a checklist of all their possible failure modes. For each checklist item, we also prompt the model to generate an “importance” weight (from 0 to 100).

4.2 Baselines

To show that RLCF is more effective than existing approaches, we compare against baselines: instruction finetuning, specially-trained reward models (using either a single reward or mixture of rewards), and prompted AI judges (using either a single evaluation rubric or a mixture of rubrics). Instruction Finetuning: We compare with instruction finetuning, to isolate the benefit of additional knowledge from the manner it is given (ground truth or rewards). Here, we distill [Hinton et al., 2015] from a larger model, Qwen2.5-72B-Instruct, finetuned via LlamaFactory [Zheng et al., 2024].

Reward Models: We mirror our training approach for learning from checklist feedback, but using state-of-the-art reward models to decide which response should be chosen or rejected. Here, we keep the 40% of prompts and responses with the greatest difference in scalar rewards. We consider following reward models as baselines (Skywork/Skywork-Reward-Gemma-2-27B) [Liu et al., 2024b] and ArmoRM-Llama3-8B-v0.1) [Wang et al., 2024b]. Both are highly rated on RewardBench [Lambert et al., 2024b]6, and ArmoRM has been very effective for alignment in prior work [Meng et al., 2024].

Prompted AI Judge: We lastly compare against using the same “teacher” model as a judge, without using checklists. We query this teacher in two settings: 1) “Ultrafeedback”, where the judge rates all candidate responses from 1-5 [Cui et al., 2023] separately across four quality aspects (instruction following, helpfulness, truthfulness, honesty) and averages these scores; and 2) AI Judge, where a near-identical prompt as RLCF is used (§3) to similarly sample 25 scores between 0 and 100 from the judge. This uses the AI judge the same way as RLCF, just without a checklist. In Figure 3, we unify these methods of automatic evaluation to distinguish our method from prior art. In this context, checklist feedback can be viewed as a very large mixture of prompted evaluators.

An example 50-point response to the example question above is "Mushrooms, because they can be easily caramelized and browned, they’re universally beloved by sophisticated palates, and they don’t look cute in the slightest while alive." The statement that they’re universally beloved by people with sophisticated palates, while potentially true, is vague and not objective.