RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

Paper · arXiv 2507.22844 · Published July 30, 2025
RLVR

The development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that optimize solely for final task success often reinforce flawed or inefficient reasoning paths, a problem we term inefficient exploration. This leads to agents that are brittle and fail to generalize, as they learn to find solutions without learning how to reason coherently. To address this, we introduce RLVMR, a novel framework that integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tag its cognitive steps—such as planning, exploration, and reflection—and provides programmatic, rule-based rewards for actions that contribute to effective problem-solving. These process-centric rewards are combined with the final outcome signal and optimized using a critic-free policy gradient method.

The quest to build autonomous agents capable of solving complex, long-horizon tasks has gained significant momentum with the rise of Large Language Models (LLMs) (Zeng et al., 2024;Wang et al., 2022; Bai et al., 2024). However, dominant training paradigms face a fundamental trade-off. On one hand, Supervised Fine-Tuning (SFT) on expert trajectories can teach agents efficient behaviors, but these policies are often brittle and fail to generalize to novel situations (Chu et al., 2025). On the other hand, Reinforcement Learning (RL) from environmental feedback encourages exploration and can lead to better generalization, but it typically optimizes for a single, sparse reward signal: final task success.

This reliance on outcome-only rewards raises a critical, yet underexplored question: Are agents learning to reason coherently, or are they just finding brittle shortcuts to success? Our work investigates a pervasive issue we term inefficient exploration, where agents are rewarded for successful outcomes even when their path to success is built on flawed, illogical, or redundant reasoning. As illustrated in Figure 1, this leads to agents that exhibit high rates of repetitive actions and struggle to adapt to unseen tasks, because their underlying problem-solving process is unsound. Standard RL inadvertently reinforces any successful trajectory, failing to distinguish between robust and flawed reasoning processes. This deficiency undermines agent reliability, interpretability, and generalization, especially as tasks grow in complexity.

We argue that to build truly robust and generalizable agents, we must move beyond rewarding only the final outcome and begin to supervise the reasoning process itself. Drawing inspiration from metacognitive theory (Martinez, 2006), which posits that effective problem-solving depends on “thinking about thinking”, we propose to directly reward beneficial cognitive behaviors. Our key insight is that high-level skills like planning, monitoring progress, exploring alternatives, and reflecting on errors can be operationalized as distinct, verifiable steps within an agent’s reasoning process.

To this end, we introduce Reinforcement Learning with Verifiable Meta-Reasoning Rewards (RLVMR), a novel framework that integrates dense, process-level supervision into end-to-end RL. As illustrated in Figure 2, RLVMR contrasts with standard RL by rewarding not only the final outcome but also the intermediate reasoning steps. Our framework defines a set of core metareasoning behaviors — planning, exploration, and reflection/monitoring — and enables the agent to articulate its cognitive state through special tags. During online interaction, we use lightweight, programmatic rules to grant verifiable rewards for these behaviors. For example, an ‘exploration’ tag is rewarded when the agent discovers a new state, while a ‘reflection’ tag is rewarded when it leads to the correction of a prior mistake. These process-centric rewards are combined with the global outcome reward and optimized using a policy gradient method. After a brief “cold-start” supervised fine-tuning (SFT) phase on only 200 trajectories to learn the tag syntax, the agent is trained entirely through environmental interaction.

We observe that, although various methods (e.g., GRPO) can improve an agent’s success rate on specific tasks, such improvement is often due to reinforcement of state-action mappings associated with correct reasoning. However, the agent’s self-reflection and understanding of its own reasoning process are frequently overlooked. The following is a trajectory segment of a vanilla GRPO-trained agent performing a novel task—– put two keychains in the safe — in ALFWorld (corresponding to our L2 split). By step 7, the agent has already arrived at dresser 1. However, we observe that in the subsequent steps, the agent falls into a sequence of inefficient decisions: its next intention is to find the second keychain, yet it persistently tries to go to dresser 1 for several steps, disregarding the fact that it is already there. This indicates that its policy mainly reflects the action distribution present in the training data, rather than allowing the reasoning process itself to truly regulate decision-making. Although the agent can form relatively effective action strategies for completing tasks, its capacity for critically evaluating its own behavior and understanding the underlying task requirements remains limited. This suggests that the agent has not truly acquired the reasoning patterns necessary for robust task-solving.

2.3 The Problem of Inefficient Exploration

We now present large-scale empirical results that corroborate the anecdotal evidence above. Figure 3 compares SFT and GRPO across success, invalid-action, and repetitive-action metrics.

SFT creates efficient but brittle policies that fail to generalize. As seen, SFT significantly boosts performance on seen tasks (L0) compared to the ReAct baseline. For instance, the 7B model’s success rate jumps from 23.1% to 63.3%. This approach also yields highly efficient policies with low invalid action rates (e.g., 6.2% on L0 for the 7B model). However, this efficiency is brittle. On the most challenging unseen split (L2), the 7B model’s success rate plummets to 37.5%. Furthermore, its repetitive action rate nearly doubles from 13.9% on L0 to 24.5% on L1, revealing a critical flaw: when faced with novel situations not covered by expert data, the agent falls back on non-productive loops. This demonstrates that SFT teaches agents to mimic actions without instilling a robust, generalizable reasoning process.

RL with outcome-only rewards (GRPO) improves generalization but fosters inefficient and flawed reasoning. In contrast, GRPO achieves substantially better generalization. The 7B GRPO model attains success rates of 77.3% on L1 and 52.3% on L2, significantly outperforming SFT. This success, however, validates our core hypothesis about the inefficient exploration problem. The agent’s performance is undermined by severe inefficiency, as evidenced by high invalid and repetitive action rates across all splits. For example, the 7B model’s repetitive action rate on the most difficult L2 tasks is a staggering 31.2%. By optimizing solely for final task success, GRPO reinforces any path that leads to a positive outcome, even those built on illogical steps, redundant actions, and inefficient exploration.

3.1 Task Formulation as a Markov Decision Process

We formalize the interaction between an agent and its environment in long-horizon tasks as a Markov Decision Process (MDP). An MDP is defined by a tuple (S, A,O, F, R), where S is the set of environment states, A is the action space, O is the observation space, F : S × A → S is the state transition function, and R : S × A → R is the reward function. In our setting, which is tailored for LLM agents, the state, action, and observation spaces (S, A,O) are all represented as natural language sequences over a finite token vocabulary.

At each timestep t, the agent’s policy πθ generates a thought process tht and an action at based on the current state st: (tht, at) ∼ πθ(· | st). The agent’s interaction with the environment produces a trajectory τ = {(o1, th1, a1), (o2, th2, a2), . . . , (on, thn, an)}. In many long-horizon tasks, reward signals are sparse, typically provided only as a final outcome reward R(τ) at the end of an episode. This sparsity poses significant challenges for credit assignment. The agent’s objective is to learn an optimal policy πθ that maximizes the expected cumulative reward

3.2 Operationalizing Meta-Reasoning in LLM Agents

Our approach is grounded in metacognitive theory (Martinez, 2006; Lai, 2011), which emphasizes “thinking about thinking”. Metacognition comprises two key components: metacognitive knowledge (an agent’s self-awareness of its own reasoning strategies) and metacognitive regulation (the active control of these processes, including planning, monitoring, and adaptive revision). This theoretical lens suggests that for LLM agents to solve complex tasks, they require not just domain knowledge but also the capacity for dynamic planning, self-monitoring, and creative exploration.

To operationalize these principles, we extend the ReAct framework. While ReAct interleaves reasoning and actions (e.g., “Think: ..., Act: ...”), it treats reasoning as a monolithic process. We refine this by introducing a structured set of meta-reasoning tags to explicitly represent distinct cognitive functions. This decouples reasoning from actions and enables fine-grained analysis and supervision. Specifically, we define four meta-reasoning tags, each enclosed in XML-style tags (e.g., planning>), while all actions are contained within the action> tag.

• Planning (planning>): Decomposes the task into high-level steps to formulate an overall strategy. Used at the start of a task or when replanning is needed.

• Exploration (explore>): Generates hypotheses or options to navigate uncertainty or bottlenecks, encouraging creative problem-solving.

• Reflection (reflection>): Reviews history to analyze errors and formulate corrective actions. Typically triggered after unsuccessful attempts.

• Monitoring (monitor>): Tracks task progress against the overall plan, ensuring actions remain aligned with subgoals. Applied during routine execution.

3.3 Cold Start: Initial Meta-Reasoning Acquisition via SFT

To equip the base LLM with the foundational ability to generate structured meta-reasoning, we begin with a supervised fine-tuning phase. This step is crucial, as reasoning patterns learned during subsequent reinforcement learning are heavily influenced by the base model’s capabilities. The SFT data is constructed as follows:

  1. We collect a dataset of successful task trajectories containing only observation-action pairs.

  2. We employ a more powerful teacher model (e.g., GPT-4) to annotate these trajectories with our meta-reasoning tags, inferring the most likely cognitive step preceding each action. This process creates synthetic, reasoning-rich expert demonstrations.

  3. The target LLM is fine-tuned on these annotated trajectories, learning to imitate the expert’s meta-reasoning and action generation patterns.

3.4.1 Meta-Reasoning-Aware Reward Shaping

During reinforcement learning, we guide the agent with a composite reward signal that combines task completion with the quality of the reasoning process. This signal comprises a sparse outcome reward and a dense, process-based meta-reasoning reward.

Meta-Reasoning Reward (rMR t ): A dense reward assigned at each step t to incentivize locally beneficial behaviors.

• Planning Reward (rplanning): Awarded for a planning> step if the trajectory ultimately succeeds.

• Exploration Reward (rexplore): Awarded if the current action targets a new object or location, discouraging redundancy.

• Reflection Reward (rreflection): Awarded if a reflection> step is followed by a corrective action after a sequence of failures.

5 Related Work

LLM Reinforcement Learning Reinforcement learning (RL) has been instrumental in aligning large language models (LLMs) with human preferences. Prominent examples include Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) and Direct Preference Optimization (DPO) (Rafailov et al., 2023). Beyond alignment, recent work has also leveraged RL to enhance other crucial LLM capabilities, such as reasoning (Hu et al., 2025; Muennighoff et al., 2025) and emotional intelligence (Wang et al., 2025a). Recently, group-based RL algorithms have emerged as a promising alternative, with methods like GRPO (Feng et al., 2025a), Dr.GRPO (Liu et al., 2025), and DAPO (Yu et al., 2025) estimating advantages by using batches of samples generated from the same prompt. In contrast to actor-critic methods like PPO (Schulman et al., 2017), this approach to advantage estimation does not require an additional critic model, making large-scale RL training for LLMs more computationally efficient and practical. These approaches have demonstrated significant effectiveness in tasks such as mathematical reasoning, search, and tool use (Yu et al., 2025; Hu et al., 2025). However, applying these RL methods to multi-turn, long-horizon tasks remains a significant challenge, primarily due to issues of sparse and delayed rewards (Wang et al., 2025b), which is the focus of our work.