Reward Models
Related topics:
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?2 What to Scale “What to scale” refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference. When applying TTS , researchers typically choose a sp…
- AI Can Learn Scientific TasteGreat scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potent…
- ARGS: Alignment as Reward-Guided Searchwe introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process, eliminating the need for expensive RL training. By adjusting the model’s pr…
- Adapting LLM Agents with Universal Feedback in Communicationrecent works also focus on how to train the LLMs agent use linguistic feedback and non-linguistic reward signals. The linguistic feedback is usually processed as instruction data to do Instruction Fin…
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought RewardLarge language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to ext…
- Auditing language models for hidden objectivesWe study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pip…
- Beyond Binary Rewards: Training LMs to Reason About Their UncertaintyWhen language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. T…
- Beyond Reward Hacking: Causal Rewards for Large Language Model AlignmentRecent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligni…
- Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVRA prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-le…
- Bridging Offline and Online Reinforcement Learning for LLMsWe investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and n…
- Can LLM be a Personalized Judge?In this paper, we investigate the reliability of LLM-as-a-Personalized- Judge—asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized…
- Can Large Reasoning Models Self-Train?Scaling the performance of large language models (LLMs) increasingly depends on methods that reduce reliance on human supervision. Reinforcement learning from automated verification offers an alternat…
- Chain-of-thought Reasoning Is A Policy Improvement Operator“A major challenge that has prevented past efforts of self-learning in language models from succeeding, especially in arithmetic, is a phenomenon that we call error avalanching. During self-training, …
- Checklists Are Better Than Reward Models For Aligning Language ModelsLanguage models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmful…
- Conversational Graph Grounded Policy Learning for Open-Domain Conversation GenerationTo address the challenge of policy learning in open-domain multi-turn conversation, we propose to represent prior information about dialog transitions as a graph and learn a graph grounded dialog poli…
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse DomainsHowever, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse realworld domains including medicin…
- Deep Think with ConfidenceLarge Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishi…
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model“While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised…
- Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language ModelsAligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To…
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in PretrainingReinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning mod…
- Efficient Reinforcement Learning via Large Language Model-based SearchReinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is pronounced if there are stochastic transitions. To improve the sample efficiency, reward shapi…
- Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented GenerationLarge language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the sta…
- Escaping the Verifier: Learning to Reason via DemonstrationsTraining Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite off…
- External Model Motivated Agents: Reinforcement Learning for Enhanced Environment Samplingwe propose an agent influence framework for RL agents to improve the adaptation efficiency of external models in changing environments without any changes to the agent’s rewards. Our formulation is co…
- Foundations of Large Language ModelsThe main part of BERT models is a multi-layer Transformer network. A Transformer layer consists of a self-attention sub-layer and an FFN sub-layer. Both of them follow the post-norm architecture: outp…
- GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative ReasoningRecent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face …
- Generating Query-Relevant Document Summaries via Reinforcement LearningE-commerce search engines often rely solely on product titles as input for ranking models with latency constraints. However, this approach can result in suboptimal relevance predictions, as product ti…
- Improving Reinforcement Learning from Human Feedback Using Contrastive Rewardswe improve the effectiveness of the reward model by introducing a penalty term on the reward, named contrastive rewards. Our approach involves two steps: (1) an offline sampling step to obtain respons…
- Inference-Time Scaling for Generalist Reward ModelingReinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that p…
- Information-Theoretic Reward Decomposition for Generalizable RLHFA generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models can lack t…
- Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference DataIn this paper, we introduce Inverse-Q*, an innovative framework that transcends traditional RL methods by optimizing token-level reinforcement learning without the need for additional reward or value …
- J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement LearningThe progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought…
- Jointly Reinforcing Diversity and Quality in Language Model GenerationsPost-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also shar…
- KTO: Model Alignment as Prospect Theoretic OptimizationFor LLMs, alignment methods such as RLHF and DPO have consistently proven to be more beneficial than doing supervised finetuning (SFT) alone. However, human feedback is often discussed only in the con…
- LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making AbilitiesThe success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effec…
- LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction FollowingAfter the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and en…
- Language Model Personalization via Reward FactorizationModern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference m…
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned SummariesAs everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. Whil…
- Learning to Reason without External RewardsTraining large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We ex…
- Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought ReasoningLarge language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question. However, it is unclear to what degree the model’s final answer is faithful…
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-JudgeLarge Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et a…
- Natural Emergent Misalignment From Reward Hacking In Production RlWe show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
- Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task SchedulingThe pursuit of general-purpose artificial intelligence depends on large language models (LLMs) that can handle both structured reasoning and open-ended generation. We present OMNI-THINKER, a unified r…
- Online Intrinsic Rewards for Decision Making Agents from Large Language Model FeedbackAutomatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration,…
- Outcome-based Exploration for LLM ReasoningReinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness …
- Persona Vectors: Monitoring and Controlling Character Traits in Language ModelsLarge language models interact with users through a simulated “Assistant” persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals…
- Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedbackPersonalising LLMs through micro-level preference learning processes may result in models that are better aligned with each user. However, there are several normative challenges in defining the bounds…
- Personalizing Reinforcement Learning from Human Feedback with Variational Preference LearningHowever, current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. When these differences arise, traditional RLHF fram…
- Post-Completion Learning for Language ModelsCurrent language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos) token, overlooking the potential learning opportunities in the post-completion space. We…
- Post-Training Large Language Models via Reinforcement Learning from Self-FeedbackLarge Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self- Feedback (RLSF…
- Pre-Trained Policy Discriminators are General Reward ModelsUnlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level op…
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy OptimizationReinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent cap…
- RLHF Workflow: From Reward Modeling to Online RLHFWe present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin…
- RLNVR: Reinforcement Learning from Non-Verified Real-World RewardsThis paper introduces RLNVR (Reinforcement Learning from Non-Verified Rewards), a framework for training language models using noisy, real-world feedback signals without requiring explicit human verif…
- RLPR: Extrapolating RLVR to General Domains without VerifiersReinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical an…
- RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic AgentsHowever, the exploration of RLVR for enhancing dialogue capabilities faces several key obstacles: • the lack of a stable, realistic, and scalable environment for multi-turn conversational rollouts; …
- RM-R1: Reward Modeling as ReasoningReward modeling is essential for aligning large language models with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) shoul…
- ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMsProcess Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model fi…
- Reinforcement Learning be Enough for Thinking?In the context of large language models (LLMs), recent work by Guo et al. proposed a unified model whereby System 2 type “thinking” emerged as a consequence of model-free RL applied to solve mathemati…
- Reinforcement Pre-TrainingIn this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a rea…
- Reinforcing General Reasoning without VerifiersThe recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and ma…
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI’s o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. W…
- Reward Reasoning ModelReward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to …
- Reward-Robust RLHF in LLMsAs Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achievin…
- RewardBench: Evaluating Reward Models for Language ModelingTo enhance scientific understanding of reward models, we present REWARDBENCH, a benchmark dataset and code-base for evaluation. The REWARDBENCH dataset is a collection of prompt-chosen-rejected trios …
- Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmentit is generally costly and unstable to fine-tune large foundation models using reinforcement learning (RL), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences fur…
- Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsExtending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unamb…
- SERL: Self-Examining Reinforcement Learning on Open-DomainReinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivit…
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model ParametersEnabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this …
- Self-Rewarding Language ModelsWe posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human prefer…
- Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPOWhile Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard rein…
- SimPO: Simple Preference Optimization with a Reference-Free RewardDirect Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance si…
- Spurious Rewards: Rethinking Training Signals in RLVRWe show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlat…
- StepWiser: Stepwise Generative Judges for Wiser ReasoningAs models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Proces…
- TTRL: Test-Time Reinforcement LearningThis paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during i…
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-FutureSelf-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improvi…
- Test-Time Scaling with Reflective Generative ModelWe introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3- mini’s performance via the new Reflective Generative Form. The new form focuses on highquality reasoning traje…
- TreeRL: LLM Reinforcement Learning with On-Policy Tree SearchCompared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards dur…
- TruthRL: Incentivizing Truthful LLMs via Reinforcement LearningWhile large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand in…
- Using Natural Language for Reward Shaping in Reinforcement LearningUsing arbitrary natural language statements within reinforcement learning presents several challenges. First, a mapping between language and objects/actions must implicitly or explicitly be learned, a…
- Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable RewardsReinforcement learning with verifiable rewards (RLVR) has facilitated significant advances in large language models (LLMs), particularly for reasoning tasks with objective, ground-truth answers, such …
- rStar2-Agent: Agentic Reasoning Technical ReportWe introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognit…