Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

Paper · arXiv 2506.13351
Chain-of-Thought and Reasoning MethodsReasoning by Reflection and Self-CritiqueSelf-Refinement and Self-Consistency

Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers.

Since chain-of-thought (CoT) reasoning acts as a latent prefix that conditions the final answer, we measure the model's token-level certainty of a reference answer under this prefix, capturing how likely the generated CoT reasoning is to yield the desired answer. We observe that the most informative tokens in reference answers, often not lexically distinctive, are those whose self-certainty varies substantially with the CoT, whereas the majority is largely unaffected by reasoning. We refer to these tokens as reasoning-reflective tokens. In a uniform average of self-certainties, however, their contribution is diluted by the low-variance majority. R3 therefore identifies and up-weights these reasoning-reflective tokens leveraging their cross-rollout variance, producing a more focused reward signal that sharpens the contrast across rollouts in a group.

Yet token-level dense rewards alone remain vulnerable to reward hacking: without lexical or semantic supervision, an actor model may produce a rollout group whose answers are uniformly low-quality yet still exhibit relative differences under the token-level dense metric, misleading the gradient update. Rubrics provide precisely this supervision. Yet directly converting rubric judgments into dense rewards is brittle. We therefore use rubrics for gating: a rollout group is accepted or rejected based on essential task criteria, rather than converted into dense rewards. While R3 primarily targets the quality of CoT reasoning, rubric-gating provides complementary supervision that ensures final answers satisfy fundamental task constraints.

We introduce DRO, a constrained RL recipe for unverifiable tasks with reference answers that cleanly separates optimization from feasibility. DRO leverages cross-rollout variance of reference tokens under different reasoning prefixes, both as R3 that up-weights reasoning-reflective tokens and as a query filter that discards groups with insufficient signal; rubrics, in turn, serve as gates on the final answer rather than as rewards. Empirically, across four datasets, DRO outperforms strong baselines, learns up to 2–3× faster, improves rubric compliance without rubric-derived rewards, and trains more stably and sample-efficiently.