Can evaluation trajectories and interaction histories replace single-answer scoring?
This explores whether richer evaluation signals — reasoning traces, multi-turn feedback, decomposed attributes, interaction histories — can replace the single scalar score we usually use to judge an answer; the corpus says they substantially can, and shows several places where the single number actively hides what's going on.
This explores whether richer evaluation signals — reasoning traces, multi-turn feedback, decomposed attributes, interaction histories — can replace the single scalar score, and the corpus makes a strong case that they can, and often should. The starting point is that a single final-answer score is dangerously lossy. The clearest demonstration: supervised fine-tuning can raise benchmark accuracy while *cutting* the quality of the reasoning steps by nearly 39%, because models learn to reach the right answer through post-hoc rationalization rather than genuine inference — and a metric that only checks the final answer never notices Does supervised fine-tuning improve reasoning or just answers?. The same blind spot lets imitation models fool human evaluators by copying a confident, fluent style while closing no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. Single-answer scoring measures the surface and misses the substance.
The most direct replacement is to make the *evaluator itself* reason before it scores. Three independent teams converged on adding chain-of-thought traces ahead of reward scoring, which lets evaluation scale with test-time compute and pushes the capability ceiling of reward models past what outcome-only scoring can reach Can reward models benefit from reasoning before scoring?. The reason this matters is captured by work showing that numerical rewards simply lack the information about *why* a solution failed — when models stuck on a performance plateau are handed natural-language critiques instead of a scalar, they start producing correct solutions again Can natural language feedback overcome numerical reward plateaus?. A trajectory of critique carries signal a single number structurally cannot.
Another thread argues the score should be *decomposed* rather than collapsed. Training models to ask good clarifying questions works far better when 'quality' is broken into theory-grounded attributes — clarity, relevance, specificity — than when optimized against one combined score, especially in high-stakes domains like clinical reasoning Can models learn to ask genuinely useful clarifying questions?. Multi-agent evaluation extends this laterally: instead of one judge emitting one number, stakeholder personas extracted from real documents debate across structured phases, producing reproducible judgments that transfer across tasks Can personas extracted from documents generalize across evaluation tasks?. Evaluation becomes a process with structure, not a point estimate.
On the 'interaction histories' half of the question, the corpus offers a useful complication. You might assume more history is always better, but for personalization the opposite holds: abstract preference summaries (semantic memory) consistently beat replaying specific past interactions (episodic memory) Does abstract preference knowledge outperform specific interaction recall?. So the win isn't raw trajectory data — it's the *distilled* signal a trajectory lets you compute. Relatedly, models can be trained to internalize self-evaluation, computing their own reward over the course of generation rather than deferring to an external scorer, at zero inference cost Can models learn to evaluate their own work during training?.
The synthesis: 'replace' is the wrong frame, but 'subsume' is right. Single-answer scoring survives as a cheap final check, yet across reward modeling, RL feedback, question-asking, and judging, the trajectory-and-process approaches don't just add accuracy — they recover information the scalar deletes, and they catch failures (rationalization, style-mimicry, plateau-stalling) that a number is constitutionally unable to see. What you didn't know you wanted to know: sometimes the richest signal isn't keeping the whole history at all, but knowing which part of it to abstract away.
Sources 8 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.