OpenClaw-RL: Train Any Agent Simply by Talking

Paper · arXiv 2603.10165 · Published March 10, 2026

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards.

Waste 1 — Evaluative signals. The next-state signal implicitly scores the preceding action: a user re-query signals dissatisfaction, a passing test signals success, and an error trace signals failure. This forms a natural process reward and requires no separate annotation pipeline, yet PRMs have been studied almost exclusively in mathematical reasoning with verifiable ground truth (Cui et al., 2025b; Lightman et al., 2023; Wang et al., 2024). In personal agents, it captures user satisfaction turn by turn. In general agents, it provides the dense per-step credit assignment that long-horizon tasks require (Wang et al., 2026). Existing systems either ignore this signal or exploit it only in offline, pre-collected form, relying on fixed datasets or terminal outcome rewards.

Waste 2 — Directive signals. Beyond scoring, next-state signals often carry directive information: a user who says “you should have checked the file first” specifies not only that the response was wrong, but also how it should change at the token level. Likewise, a detailed SWE error trace often implies a concrete correction direction. Current RLVR methods use scalar rewards and thus cannot convert such information into a directional policy gradient (Guo et al., 2025; Hu et al., 2025; Shao et al., 2024; Yu et al., 2025a), while distillation methods (Hübotter et al., 2026; Shenfeld et al., 2026) rely on pre-curated feedback-response pairs rather than live signals. Hindsight relabeling (Hübotter et al., 2026; Zhang et al., 2023) and context-enriched distillation (Yang et al., 2024b, 2025c) show that adding structured correction information to the context can substantially improve outputs, but these methods all operate on fixed datasets. In concurrent work, Buening et al. (2026) improves online policy by directly prompting with next-state information, but the corrective hints remain implicit.

OpenClaw-RL. We present OpenClaw-RL, a unified framework that recovers both forms of next-state signal waste for personal agents and general-purpose agents across diverse settings, including personal conversations with OpenClaw (OpenClaw, 2026), terminal, GUI, SWE, and tool-call environments. OpenClaw-RL is a fully decoupled asynchronous architecture built on slime (Zhu et al., 2025), where policy serving, rollout collection, PRM judging, and policy training run as four independent loops with no blocking dependencies. In the personal-agent setting, the model can be optimized automatically through normal usage. This extends existing RL infrastructure, which typically assumes batch data collection rather than continuous learning from live deployment. We provide two optimization options. First, binary RL uses a PRM to recover conversations as scalar process rewards. Second, our Hindsight-Guided On-Policy Distillation (OPD) extracts textual hints from the next state, constructs an enhanced teacher context, and distills token-level directional supervision back into the student, providing training signals unavailable from scalar rewards alone. In simulation experiments, we find that combining the two methods with a weighted loss yields significant gains. Our framework also extends to RL training for general agents, including terminal, GUI, SWE, and tool-call settings. We integrate PRM judging with verifiable outcomes to provide supervision that is both dense and reliable (Wang et al., 2026; Zou et al., 2025). We further enhance the scalability of this framework by allowing environments to be hosted at scale on cloud services.

Contributions.

• Next-state signal as a live, online learning source. We identify that next-state signals, whether user replies, execution results, test verdicts, or GUI transitions, encode both evaluative and directive information about the preceding action. We recover these signals as a live, online training source across heterogeneous interaction types.

• OpenClaw-RL Infrastructure. The first system to unify multiple concurrent interaction streams, including personal conversations, terminal, GUI, SWE, and tool-call agentic settings. It is designed for zero interruption to serving, with session-aware multi-turn tracking, graceful weight updates, flexible PRM support, and large-scale environment parallelization.

• Two complementary next-state signal recovery methods. Binary RL via PRM converts evaluative next-state signals into dense scalar process rewards, while our Hindsight-Guided OPD converts directive signals into token-level advantage supervision by extracting textual hints from the next state and constructing an enhanced teacher context, where rich textual feedback provides directional guidance for improvement.

• Empirical validation across personal and general agents. We validate OpenClaw-RL in experiments on both personal agent personalization and agentic RL across terminal, GUI, SWE, and tool-call settings. We provide evidence that binary RL and Hindsight-Guided OPD are complementary, and that their combination yields significant gains for personal agents. We also validate the effectiveness of integrating process and outcome rewards in the general-agent RL setting.