What infrastructure decouples generation from training in asynchronous agent loops?

This explores the systems-level question of how a learning agent can keep acting and generating data while its model is being trained at the same time — rather than freezing one to do the other.

This explores the systems-level question of how a learning agent can keep acting and generating data while its model is being trained at the same time — rather than freezing one to do the other. The corpus has a direct answer and a set of adjacent ideas that reframe what 'decoupling' even buys you. The clearest piece of infrastructure is AReaL's fully asynchronous RL design Can RL training run while generation continues without waiting?: generation workers keep producing rollouts continuously while a separate trainer updates weights, and a modified PPO absorbs the fact that samples now arrive 'stale' — generated by a slightly older model version than the one currently training. The payoff is high GPU utilization and, importantly, practical multi-turn RL, where a single long episode would otherwise stall the whole pipeline waiting on the slowest trajectory.

What makes this matter shows up in the papers about *why* on-policy interaction is worth the engineering trouble. Agents trained only on static expert demonstrations are capped by the imagination of whoever built the dataset — they never see their own failures or anything outside the demonstrated scenarios Can agents learn beyond what their training data shows?. Asynchronous loops exist precisely so an agent can learn from its own live experience without the throughput penalty. Pushing that logic further, one line of work argues the training signal doesn't even need to be curated: every action an agent takes produces a next-state signal — a user reply, a tool output, an error, a changed screen — that can feed the policy directly, unifying all agent training under one continuous loop Can agent deployment itself generate training signals automatically?.

The most surprising adjacent move is to decouple generation from training by removing weight updates from the critical path entirely. AgentFly reframes learning as memory operations over a Memory-augmented MDP — case, subtask, and tool memories carry credit assignment and policy improvement while the model's parameters stay frozen Can agents learn continuously from experience without updating weights?. Here 'generation' and 'learning' aren't two synchronized processes to interleave; learning lives in an external store the agent reads and writes during normal operation. SkillOS shows a related split on the skill-library side: a trainable curator evolves the repository while the executor stays frozen, so the thing that improves and the thing that runs are different components on different update clocks Can a separate trained curator improve skill libraries better than frozen agents?.

One more enabler is worth knowing about, because asynchronous RL is bottlenecked by reward latency, not just rollout latency. If every reward requires actually executing code, the trainer waits. Execution-free verification — structured reasoning that reaches ~93% accuracy judging whether two code patches are equivalent — crosses the reliability bar to serve as an RL reward signal without running anything Can structured reasoning replace code execution for RL rewards?. That's infrastructure too: it decouples the reward from the runtime, the same way async training decouples the update from the rollout.

The thread running through all of these: 'decoupling generation from training' isn't one trick but a design axis. You can stagger the two in time (async RL with stale-sample-tolerant PPO), move learning out of the weights into memory or a curated library, or cut the reward's dependence on execution — and the right choice depends on which synchronization point is actually stalling your loop.

Sources 6 notes

Can RL training run while generation continues without waiting?

AReaL enables continuous generation across workers while training runs on mixed model versions using modified PPO. The system achieves high GPU utilization and handles stale samples effectively, making multi-turn RL practical.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can agent deployment itself generate training signals automatically?

Every agent action produces a next-state signal (user reply, tool output, error, GUI change) that can train the policy directly. This universal signal source eliminates the need for separate training datasets across conversations, terminal tasks, SWE, and tool use.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

What infrastructure decouples generation from training in asynchronous agent loops?

Sources 6 notes

Next inquiring lines