Reinforcement Learning for LLMs

Can agents learn to reason better without just chasing rewards?

Explores whether reinforcement learning can train agents to exhibit genuine metacognitive reasoning—planning, reflection, exploration, monitoring—rather than simply optimizing for task success through any means necessary.

Note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

Outcome-only RL (e.g., GRPO) for agentic tasks reinforces any successful trajectory — including those built on flawed, redundant, or illogical reasoning. Empirically: 31.2% repetitive action rate on hard tasks, agents persistently attempting actions on locations they've already reached, policy reflecting training action distributions rather than genuine reasoning about task requirements. The agent achieves but does not understand.

RLVMR (Reinforcement Learning with Verifiable Meta-Reasoning Rewards) addresses this by operationalizing metacognitive theory as verifiable process rewards. Four meta-reasoning tags — planning, exploration, reflection, monitoring — are introduced as structured cognitive labels. Each receives programmatic rewards tied to observable outcomes:

The cold start requires only 200 SFT trajectories annotated by a teacher model with the tag syntax. After that, the agent trains entirely through environmental interaction with dense process rewards combined with sparse outcome rewards.

Since Can AI systems improve their own learning strategies?, RLVMR provides a partial solution: the metacognitive categories are still human-designed, but the specific behaviors within each category are learned through RL interaction. The framework bridges between fixed metacognitive scaffolds and fully autonomous self-monitoring.

A related metacognitive capability emerges from proactive critical thinking training: since Can models learn to ask clarifying questions instead of guessing?, both RLVMR and proactive critical thinking operationalize metacognition as trainable RL objectives. RLVMR's "monitoring" and "reflection" tags teach the agent to track its own reasoning quality during task execution; proactive critical thinking teaches the model to detect when a problem is ill-posed before attempting to solve it. Both address the gap between achieving outcomes and demonstrating genuine reasoning awareness, and both show near-zero capability at baseline that RL training dramatically improves.

The SFT/GRPO contrast is instructive: SFT creates efficient but brittle policies (success drops from 63.3% to 37.5% on unseen tasks), while GRPO achieves better generalization (52.3% on hard unseen) but with severely inefficient reasoning. RLVMR targets the gap — maintaining GRPO's generalization while reducing the reasoning inefficiency.


Source: RLVR

Related concepts in this collection

Concept map
16 direct connections · 164 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

meta-reasoning rewards for agentic rl operationalize metacognition as verifiable process supervision — separating reasoning quality from outcome success