INQUIRING LINE

Can smaller judge models better capture human preferences than larger prompted models?

This explores whether a smaller model trained as a preference judge can outdo a bigger model that's merely prompted to judge — and the corpus answers it sideways, through work on student models beating their teachers and on what 'human preference' even is.


This explores whether a smaller, trained judge can capture human preferences better than a larger model you simply prompt to evaluate — and while the collection has no paper that runs that head-to-head as 'LLM-as-judge,' it holds two threads that together make a strong case for yes.

The first thread is direct evidence that small trained models can beat large prompted ones at exactly this kind of discrimination task. Walmart found that BERT cross-encoders, distilled from an LLM teacher, *outperformed the teacher itself* once trained on enough teacher-labeled data — the student saw a broader slice of real queries, smoothed by the teacher's soft labels, and generalized better than the model it learned from Can smaller models outperform their LLM teachers with enough data?. The function-calling work makes the mechanism sharper: small models tuned with DPO on correct-vs-incorrect pairs from a big teacher matched large models, because seeing explicit negative examples targets the precise failure a prompted model fumbles Can small models match large models on function calling?. A judge's whole job is telling good from bad, and 'trained on what bad looks like' beats 'prompted to imagine it.'

The second thread complicates the word you're really asking about — 'human preferences' — and this is the part worth knowing. Annotation responses don't measure one thing: they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, distinguishable only by whether they hold up across measurement conditions. Treat them uniformly and you contaminate the very signal a judge is supposed to learn Do all annotation responses measure the same underlying thing?. A *trained* small judge can be fit to the clean signal; a prompted large model inherits whatever noise lives in its instructions. That's an under-appreciated reason size isn't the deciding variable — fidelity to the right target is.

The collection also hints that you may not need human labels at all. Model confidence over answer spans can stand in as the preference signal, ranking reasoning traces well enough to improve quality *and* restore calibration without any human annotator or external verifier Can model confidence work as a reward signal for reasoning?. And preference 'capture' itself can be cheap and personal: ten adaptive questions are enough to pin down an individual's reward coefficients Can user preferences be learned from just ten questions?, while abstract preference summaries beat replaying raw past interactions Does abstract preference knowledge outperform specific interaction recall?. The pattern across all of these: what makes a judge good is the structure of its training signal, not its parameter count.

One caution the corpus surfaces — a judge's effects aren't uniform across domains. Preference tuning reduces diversity in code but *increases* it in creative writing, because each domain rewards something different Does preference tuning always reduce diversity the same way?. So 'better captures human preference' is domain-relative: a small judge fit to code-review preferences won't transfer cleanly to creative judgment. The reader's real takeaway is that the size question is a proxy for the question that matters — is your judge *trained on the right, cleanly-separated signal* — and on that axis, small and trained tends to beat large and prompted.


Sources 7 notes

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Next inquiring lines