Why do explicit quality criteria outperform learning quality from examples alone?
This explores why telling a model the explicit rules of what 'good' looks like (frameworks, checklists, named attributes) tends to beat showing it a pile of labeled good/bad examples and hoping it infers the rules — and what the corpus says about why example-only learning quietly fails.
This explores why explicit quality criteria — spelled-out frameworks, checklists, and named attributes — tend to outperform letting a model infer quality from labeled examples alone. The short version the corpus keeps circling back to: when a model learns from examples, it grabs the easiest-to-detect surface pattern, not the underlying principle. Fine-tuning a model to assess argument quality from labeled data, for instance, teaches it to recognize the look of strong arguments in the training set but fails to transfer to new argument types — only explicit theoretical frameworks like RATIO or QOAM actually carry the criteria across to unfamiliar cases Can models learn argument quality from labeled examples alone?. The deep version of this problem shows up in chain-of-thought work: logically invalid reasoning chains score nearly as well as valid ones, because the model is imitating the form of reasoning, not the inference Does logical validity actually drive chain-of-thought gains?. Examples teach shape; criteria teach substance.
Sources 6 notes
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.