INQUIRING LINE

How do evaluative versus directive signals differ in next-state training?

This explores the distinction made by [[agent-next-state-signals-decompose-into-evaluative-and-directive-information-tha]] — that feedback used to train what a model does next splits into 'how well did that go' (evaluative) versus 'here's how to change it' (directive) — and what the rest of the corpus reveals about training on each kind.


This question reads as: when you train a model on what to do next, signals come in two flavors — evaluative (a score telling you how good an action was) and directive (an instruction telling you how the action should change) — and these aren't interchangeable. The core insight from Can scalar rewards capture all the information in agent feedback? is that a scalar reward captures the evaluative part but throws away the directive part: a number can say 'that was a 3/10' but not 'you forgot to check the file path first.' The two are orthogonal and complementary, and token-level distillation can recover the directional detail that a reward number flattens away.

Once you see that split, a lot of the corpus rearranges itself around it. Pure evaluative training turns out to be surprisingly lopsided in an interesting way: Does negative reinforcement alone outperform full reinforcement learning? finds that training only on 'that was wrong' signals matches or beats full reinforcement learning, because suppressing bad trajectories preserves diversity while reward-chasing collapses it. That's evaluative feedback at its most minimal — just a thumbs-down — and it still works, which hints that the evaluative channel alone carries less information than we assume.

The limits of scalar evaluation show up most sharply where the thing being judged is subjective. Can breaking down instructions into checklists improve AI reward signals? breaks a vague 'how good was this answer' into a checklist of concrete, verifiable sub-criteria — which is really a way of smuggling directive structure into an evaluative signal, so the model learns *what specifically* to fix rather than just *how much* it missed by. And Does preference optimization harm conversational understanding? is the cautionary tale: when you optimize purely on a preference score (RLHF rewarding confident single-turn answers), the model learns to look good on the metric while quietly losing the grounding behaviors — asking clarifying questions, checking understanding — that the scalar never measured. Evaluative-only training optimizes what it can score and erodes what it can't.

The most striking move in the corpus is models generating their own directive signal. Can models learn to evaluate their own work during training? trains a model to write its own self-assessment in the unused space after its output, internalizing the evaluator so it doesn't depend on an external reward model — collapsing the evaluative/directive boundary by making the model both judge and instructee. Two-phase RL adds a temporal twist: Does RL training follow a predictable two-phase learning sequence? shows training first consolidates execution correctness (where evaluative 'right/wrong' is the right signal) and only later shifts the bottleneck to strategic planning (where directive 'do it this way instead' matters most) — so which signal helps you depends on which phase you're in. If you walk away with one thing, let it be that: a scalar reward isn't a smaller version of feedback, it's a *different kind* of information, and the directive part you discarded is often the part that would have taught the model something new.


Sources 6 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Next inquiring lines