How do task stream groupings provide long-horizon learning signals for curation decisions?

This explores how organizing an agent's experience into task streams — rather than treating each task in isolation — gives a learning system the longer-range feedback it needs to decide what's worth keeping in a skill library.

This explores how grouping an agent's work into task streams creates feedback that plays out over many tasks, and how that signal trains the part of the system that decides what to keep, discard, or generalize. The clearest answer in the corpus comes from SkillOS Can a separate trained curator improve skill libraries better than frozen agents?, which splits the agent in two: a frozen executor that does the work, and a separately trained curator that edits the skill repository. The trick is that the curator isn't rewarded on a single task's success — it's optimized across grouped task streams, so it learns which library edits pay off later, not just now. That's why its repositories drift away from generic verbose additions toward compact execution logic and cross-task meta-strategies: long-horizon grouping rewards skills that transfer, and punishes one-off bloat that looks useful in the moment but never gets reused.

What makes this work is a separation that recurs all over the collection: the thing that acts and the thing that learns-to-curate are decoupled. You see the same architecture in agent memory systems that mine past trajectories for reusable sub-task routines Can agents learn reusable sub-task routines from past experience? — and notably, the gains there grow *larger* as the gap between training and test widens, which is exactly the long-horizon payoff a curator is trying to capture. The lesson is consistent: routines abstracted at finer-than-whole-task granularity and compounded over time beat memorizing whole solutions.

There's also a question of *what signal* the curator should listen to. Outcome-only rewards — did the final answer come out right — turn out to be a weak teacher. Process-level supervision, which scores the intermediate steps, substantially outperforms it Does supervising retrieval steps outperform final answer rewards?, especially when you contrast good and bad chains directly rather than rewarding success alone. Task-stream grouping is a way of manufacturing that richer signal at a longer timescale: instead of grading one retrieval step, you're grading whether a curated skill earned its place across a whole family of tasks.

Why does any of this generalize rather than just overfit the streams it saw? The corpus offers a deeper reason in the analysis of pretraining data Does procedural knowledge drive reasoning more than factual retrieval?: reasoning rides on broad, transferable *procedural* knowledge, while factual recall depends on narrow memorization. A curator optimized over task streams is, in effect, being pushed toward the procedural end — capturing the how-to that travels, not the this-exact-case that doesn't. That also explains the failure mode it's avoiding: chain-of-thought that imitates the *form* of reasoning collapses outside its training distribution Does chain-of-thought reasoning actually generalize beyond training data?, so a curator rewarded only on near-term, in-distribution wins would happily hoard skills that look right and break later.

The thing worth carrying away: the value of a curated skill isn't visible inside the task that produced it — it only shows up later, across other tasks. Task-stream grouping is the mechanism that makes that delayed value measurable, and decoupling the curator from the executor is what lets a system act on it.

Sources 5 notes

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

How do task stream groupings provide long-horizon learning signals for curation decisions?

Sources 5 notes

Next inquiring lines