Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards

Paper · arXiv 2506.00103 · Published May 30, 2025

Reinforcement learning with verifiable rewards (RLVR) has facilitated significant advances in large language models (LLMs), particularly for reasoning tasks with objective, ground-truth answers, such as math and code generation. However, a substantial gap persists for non-verifiable tasks—such as creative writing and open-ended dialogue— where quality assessment is inherently subjective and lacks definitive, externally verifiable references. Existing methodologies for these tasks predominantly rely on scalar reward models trained using human preferences, but these models suffer from limited generalization and are vulnerable to reward hacking, including issues such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that effectively bridges the gap between non-verifiable tasks and verifiable reward. We introduce a novel pairwise Generative Reward Model (GenRM) grounded in writing principles and a new Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM applies self-principled critique to transform subjective assessments into robust, verifiable rewards, while BRPO facilitates dynamic, reference-free pairwise comparisons by utilizing bootstrapped responses as temporary references during group rollouts in reinforcement learning (RL) training. Our approach enables LLMs to cultivate advanced writing capabilities without requiring supervised fine-tuning. This is demonstrated by Writing-Zero, which exhibits consistent performance improvements and enhanced resilience to reward hacking compared to scalar reward baselines. In addition, our method achieves competitive results on both proprietary and publicly available writing benchmarks. These results suggest the potential for unifying rule-based, reference-based, and reference-free reward modeling within the RLVR framework, thereby advancing the development of a comprehensive and scalable RL training paradigm with broad applicability across various language tasks.

No results.