INQUIRING LINE

How do composite rewards attribute curation outcomes to specific skill library changes?

This explores a credit-assignment problem: when a system rewards the curation of a skill library and the curated repository performs better, how does a multi-part reward signal figure out which specific edit to the library deserves the credit?


This explores a credit-assignment problem — when a multi-part reward signal sees a curation outcome improve, how does it trace that improvement back to a specific change in the skill library? The corpus doesn't have a single paper that names this exact mechanism, but it holds the two halves you'd need to build it, sitting in different neighborhoods under different vocabulary.

The curation half is SkillOS, which shows that separating a *trainable* curator from a *frozen* executor lets the curator learn to evolve a skill repository — shifting it away from generic verbose additions toward actionable execution logic and reusable meta-strategies Can a separate trained curator improve skill libraries better than frozen agents?. The decoupling matters for attribution: because the executor is frozen, any change in outcome is cleanly traceable to a curator action rather than confounded by the policy drifting at the same time. That's the structural precondition for crediting a library edit at all.

The reward half is really a credit-assignment literature wearing the 'process supervision' label. A single outcome reward at the end of a long trajectory can't tell you which step earned it — so several methods manufacture step-level signal from the structure of the rollout itself: tree-search rollouts compare sibling subtrees to turn a trajectory reward into per-step preferences Can tree structure alone convert outcome rewards into process supervision?, and a broader family exploits tree topology, expert-aligned actions, or tool-call positions to do the same without a separately trained reward model Can trajectory structure replace hand-annotated process rewards?. In agentic RAG, this fine-grained per-step feedback substantially beats final-answer-only rewards Does supervising retrieval steps outperform final answer rewards?. Read against the skill-library question, a curated repository *is* a trajectory of edits — and the same trick (compare variants, attribute the delta to the differing edit) is what would localize credit to a specific library change.

The 'composite' part is where it gets interesting, because a curation reward is rarely one number. You're usually balancing things like generality, executability, and non-redundancy at once. Two notes warn about how that composition behaves. DVAO argues you shouldn't hand-tune fixed weights for competing objectives — weight each by its empirical within-group variance, which automatically amplifies the high-signal objective and mutes noise How should multiple reward objectives be weighted during training?. And DRO makes a sharper point about *kind* of signal: some criteria work better as gates that accept or reject a whole rollout than as dense rewards you optimize against, because converting a categorical rubric into a scalar invites reward hacking Can rubrics and dense rewards work together without hacking?. So a well-built composite reward for curation isn't a sum — it's a mix of gates (is this edit even valid?) and variance-weighted dense terms (how much did it help?).

The note that reframes the whole question is the one arguing a scalar reward is the wrong container in the first place: agent feedback decomposes into an *evaluative* signal (how well the edit did) and a *directive* one (how it should change), and a single number throws the directive part away Can scalar rewards capture all the information in agent feedback?. That's the thing you didn't know you wanted to know — 'attributing an outcome to a skill change' may be asking the reward to carry information it structurally can't, and the richer move is to keep the directive signal instead of collapsing everything into one attributable scalar.


Sources 7 notes

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

How should multiple reward objectives be weighted during training?

DVAO weights objectives by their within-group variance, automatically up-weighting high-signal objectives and suppressing noise without hyperparameter tuning. This keeps advantage magnitudes bounded and replaces fixed scalarization constants with data-driven weighting.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Next inquiring lines