Can agents learn to reason better without just chasing rewards?

Explores whether reinforcement learning can train agents to exhibit genuine metacognitive reasoning—planning, reflection, exploration, monitoring—rather than simply optimizing for task success through any means necessary.

Note · 2026-02-22 · sourced from RLVR

Outcome-only RL (e.g., GRPO) for agentic tasks reinforces any successful trajectory — including those built on flawed, redundant, or illogical reasoning. Empirically: 31.2% repetitive action rate on hard tasks, agents persistently attempting actions on locations they've already reached, policy reflecting training action distributions rather than genuine reasoning about task requirements. The agent achieves but does not understand.

RLVMR (Reinforcement Learning with Verifiable Meta-Reasoning Rewards) addresses this by operationalizing metacognitive theory as verifiable process rewards. Four meta-reasoning tags — planning, exploration, reflection, monitoring — are introduced as structured cognitive labels. Each receives programmatic rewards tied to observable outcomes:

Exploration is rewarded when the agent discovers a new state (novelty verification)
Reflection is rewarded when it leads to corrective action after failures (error-correction verification)
Planning is rewarded when the trajectory ultimately succeeds (outcome-conditioned)
Monitoring tracks progress against the plan (alignment verification)

The cold start requires only 200 SFT trajectories annotated by a teacher model with the tag syntax. After that, the agent trains entirely through environmental interaction with dense process rewards combined with sparse outcome rewards.

Since Can AI systems improve their own learning strategies?, RLVMR provides a partial solution: the metacognitive categories are still human-designed, but the specific behaviors within each category are learned through RL interaction. The framework bridges between fixed metacognitive scaffolds and fully autonomous self-monitoring.

A related metacognitive capability emerges from proactive critical thinking training: since Can models learn to ask clarifying questions instead of guessing?, both RLVMR and proactive critical thinking operationalize metacognition as trainable RL objectives. RLVMR's "monitoring" and "reflection" tags teach the agent to track its own reasoning quality during task execution; proactive critical thinking teaches the model to detect when a problem is ill-posed before attempting to solve it. Both address the gap between achieving outcomes and demonstrating genuine reasoning awareness, and both show near-zero capability at baseline that RL training dramatically improves.

The SFT/GRPO contrast is instructive: SFT creates efficient but brittle policies (success drops from 63.3% to 37.5% on unseen tasks), while GRPO achieves better generalization (52.3% on hard unseen) but with severely inefficient reasoning. RLVMR targets the gap — maintaining GRPO's generalization while reducing the reasoning inefficiency.

Source: RLVR

Related concepts in this collection

Can AI systems improve their own learning strategies? Current self-improvement relies on fixed human-designed loops that break when tasks change. The question is whether agents can develop their own adaptive metacognitive processes instead of depending on human intervention.
RLVMR partially addresses by learning metacognitive behaviors within fixed categories
Can modular cognitive tools boost LLM reasoning without training? Does structuring reasoning as discrete, sandboxed tool calls elicit stronger problem-solving in language models compared to monolithic prompting approaches, and can this approach match specialized reasoning models?
complementary approach: cognitive tools modularize reasoning without RL, RLVMR does it with RL
Can we reward reasoning steps without human annotation? Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
RLVMR provides dense process rewards for agentic setting
Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
complementary metacognitive RL objective: RLVMR trains monitoring/reflection during task execution; proactive critical thinking trains missing-information detection before task execution; both show near-zero baseline capability that RL dramatically improves
Why do outcome-based reward models fail at intermediate step evaluation? Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
RLVMR's meta-reasoning tags are a process supervision variant for agentic settings: programmatic rewards for planning/exploration/reflection/monitoring provide dense intermediate feedback without human annotation

Concept map

16 direct connections · 164 in 2-hop network ·dense cluster

Can agents learn to reason better without just c… Can AI systems improve their own learning strategi… Can modular cognitive tools boost LLM reasoning wi… Can we reward reasoning steps without human annota… Can models learn to ask clarifying questions inste… Why do outcome-based reward models fail at interme…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

meta-reasoning rewards for agentic rl operationalize metacognition as verifiable process supervision — separating reasoning quality from outcome success