Can curriculum approaches teach agents when to stop exploring?

This explores whether curriculum learning — staging tasks from easy to hard, or stretching how long an agent gets to act — can teach agents the judgment of when enough exploration is enough, rather than just how to explore more.

This explores whether curriculum learning — staging tasks from easy to hard, or gradually lengthening how long an agent gets to act — can teach agents the judgment of *when to stop* exploring, not just how to explore harder. The corpus doesn't have a paper that names this exact problem, but it has the pieces that frame it, and they pull in two opposing directions worth seeing side by side.

The strongest direct evidence that curriculum shapes exploration is in test-time interaction scaling: training agents with a curriculum that grows the number of environment steps they're allowed to take produces state-of-the-art web agents, because longer rollouts let them explore, backtrack, and replan in ways that per-step reasoning can't Does agent interaction time scale separately from reasoning depth?. Here the curriculum is teaching the agent how to *spend* an exploration budget — and implicitly, where to stop spending it. VOYAGER shows a second flavor: an *automatic* curriculum that keeps proposing new goals drives continual exploration, while a skill library lets the agent bank what it learns instead of re-deriving it Can agents learn new skills without forgetting old ones?. In both cases the curriculum is the thing regulating the explore/exploit rhythm.

But here's the twist the corpus surfaces: the more you optimize an agent with reinforcement learning, the *worse* its sense of when to keep exploring gets. RL training compresses behavioral diversity in search agents through entropy collapse — policies converge onto a few narrow reward-maximizing moves and stop probing alternatives Does reinforcement learning squeeze exploration diversity in search agents?. So a naive curriculum that just rewards success can teach an agent to stop exploring *too early*, locking it into a comfortable strategy. The fix that paper points to — preserving diversity through demonstrations — is itself a kind of curriculum design choice.

The opposite failure is just as real: exploring forever along one path. Abstractions that enforce breadth-first exploration outperform pouring all your compute into deeper and deeper single chains, precisely because depth-only reasoning runs into an 'underthinking' failure where the agent keeps going without ever stepping back Can abstractions guide exploration better than depth alone?. Read against the entropy-collapse note, you get the real shape of 'when to stop': it's a balance between collapsing too soon and drilling too long, and the structure you train against — abstractions, interaction budgets, diversity-preserving data — is what tunes that balance.

One boundary worth knowing: curriculum can only teach this if the agent actually gets to act and fail. Agents trained purely on static expert demonstrations never interact with an environment, so their competence — including any sense of when to quit exploring — is capped by whatever scenarios the curators imagined, not learned from experience Can agents learn beyond what their training data shows?. So the honest answer is: yes, curriculum approaches can shape *when* an agent stops exploring — but only the interactive, reward-shaped kind, and only if they're explicitly designed to fight entropy collapse on one side and runaway depth on the other.

Sources 5 notes

Does agent interaction time scale separately from reasoning depth?

Test-time interaction—increasing environment steps—enables exploration, backtracking, and replanning that per-step reasoning cannot achieve. Curriculum-based RL on rollout length produces SOTA web agents, showing interaction scaling dominates on tasks with partial observability.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can curriculum approaches teach agents when to stop exploring?

Sources 5 notes

Next inquiring lines