INQUIRING LINE

Can judges trained on both verifiable and non-verifiable tasks transfer across domains?

This explores whether evaluator models (LLM 'judges' and reward models) that learn from a mix of checkable tasks — where answers can be verified — and open-ended ones — where they can't — actually carry their judgment skill into new domains, rather than overfitting to one kind of task.


This explores whether evaluator models — the LLM 'judges' and reward models that grade other models' outputs — can be trained on both verifiable tasks (math, code, anything with a checkable answer) and non-verifiable ones (writing, instruction-following, subjective quality) and then generalize to domains they weren't trained on. The corpus doesn't answer this with a single clean experiment, but several threads converge on a hopeful and specific picture: the bridge between verifiable and non-verifiable evaluation is the act of *reasoning before judging*, and that reasoning skill is what transfers.

The most direct evidence is the move to make judges *think*. When judges are trained with reinforcement learning to reason through an evaluation — by recasting judgment as a verifiable problem with synthetic right/wrong pairs — they stop leaning on exploitable surface cues and start reasoning about substance Can reasoning during evaluation reduce judgment bias in LLM judges?. This matters because the alternative is fragile: ordinary LLM judges fall for fake citations and pretty formatting in zero-shot attacks that require no model access at all Can LLM judges be fooled by fake credentials and formatting?, Can LLM judges be tricked without accessing their internals?. A judge that only pattern-matches surface features won't transfer anywhere; a judge that reasons has something domain-independent to carry.

The deeper trick the corpus surfaces is that the verifiable/non-verifiable divide is softer than it looks — you can often *manufacture* verifiability inside a soft domain. Checklist methods decompose a subjective instruction-following task into many small checkable sub-criteria, which both improves performance and stops the reward model from overfitting to superficial artifacts Can breaking down instructions into checklists improve AI reward signals?. Generative process reward models that reason step-by-step before scoring beat discriminative graders with orders of magnitude less labeled data Can generative reasoning beat discriminative models with less training data?. And entirely verifier-free approaches reach into general domains: RARO uses an adversarial critic that discriminates expert from policy answers across math, code, *and* poetry without any task-specific verifier Can adversarial critics replace task-specific verifiers for reasoning?, Can reasoning emerge from expert demonstrations alone?, while VeriFree replaces answer-checking with the likelihood of a reference answer and matches verifier-based methods on broad benchmarks like MMLU-Pro and GPQA Can reasoning improvement work without answer verification?.

There's also a clean demonstration that the *evaluation apparatus itself* can transfer: MAJ-EVAL extracts stakeholder personas from domain documents and runs a structured debate that generalizes across summarization and dialogue without manual redesign Can personas extracted from documents generalize across evaluation tasks?. That's cross-domain transfer of a judging method, not of a single judge model — a useful reframing of what 'transfer' can even mean here.

The caution worth carrying away: transfer is real for *reasoning and method*, but not magic. RLVR sharpens sampling toward solutions the base model already had rather than expanding its true boundary Does RLVR actually expand what models can reason about?, and imitation training shows you can fool evaluators by copying confident style while closing no actual capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. So a judge can *look* like it transferred when it has only learned a domain's surface register. The thing you didn't know you wanted to know: the question of cross-domain judge transfer is really the question of whether your judge learned to reason or learned to recognize — and the same biases that make naive judges hackable are exactly the ones that fail to generalize.


Sources 11 notes

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Can generative reasoning beat discriminative models with less training data?

GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.

Can adversarial critics replace task-specific verifiers for reasoning?

RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.

Can reasoning emerge from expert demonstrations alone?

RARO recovers implicit reward functions from expert demonstrations through adversarial co-training between a reasoning policy and relativistic critic. This approach matches verifier-based RL performance on reasoning tasks while extending to domains lacking automated verification.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Can personas extracted from documents generalize across evaluation tasks?

MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing a claim about LLM judge transfer across domains. The precise question remains open: *Can evaluator models trained on both verifiable and non-verifiable tasks genuinely transfer reasoning capability across held-out domains, or do they only learn surface patterns that fail to generalize?*

What a curated library found — and when (findings span 2023–2025; treat as dated claims, not current truth):
• Judges trained with RL to *reason through* evaluation (recasting judgment as synthetic verification) stop exploiting surface cues and show cross-domain promise (~2025).
• Naive LLM judges fall for zero-shot prompt attacks (fake citations, formatting) across math, code, and subjective tasks; this fragility suggests surface-pattern learning, not reasoning (~2024–2025).
• Checklist decomposition and generative process reward models (reasoning before scoring) improve performance and reduce overfitting to superficial artifacts (~2025).
• VeriFree and adversarial-critic methods (RARO) reach general domains—math, code, poetry—without task-specific verifiers (~2025).
• MAJ-EVAL transfers *evaluation method* (stakeholder debate) across summarization and dialogue without redesign (~2025).
• RLVR sharpens sampling toward existing capability boundaries rather than expanding true reasoning (~2025); imitation training fools judges via confident style, not genuine capability closure (~2023).

Anchor papers (verify; mind their dates):
• arXiv:2510.10320 (J1: Thinking in LLM Judge via RL, ~2025)
• arXiv:2511.21667 (VeriFree: Reasoning without Verifiers, ~2025)
• arXiv:2504.13837 (RLVR boundary limits, ~2025)
• arXiv:2305.15717 (Imitation captures style, not capability, 2023)

Your task:
(1) RE-TEST EACH CONSTRAINT. For reasoning-via-RL judges: do newer RL algorithms (DPO, IPO, hypernetwork scaling) or architectural changes (MoE reward models, retrieval-augmented judgment) now sustain transfer beyond 2025's evidence? For surface-pattern vulnerability: have prompt hardening, adversarial training on judges, or multi-model ensembles closed the zero-shot attack surface? Separate the durable question (does reasoning transfer?) from the perishable claim (naive judges fail at X attack).
(2) Surface the strongest *contradicting or reconciling* work from the last 6 months: have any papers shown that verifier-free methods actually DO learn task-specific shortcuts, or conversely, that simple checklist decomposition outperforms RL-trained reasoning at scale?
(3) Propose 2 research questions that *assume the regime may have moved*: (a) If reasoning does transfer, does it transfer *equally well* to adversarially-crafted or multi-modal domains, or does the reasoning capability degrade in unfamiliar modalities? (b) Can a single judge, trained on a mixed portfolio (verifiable + non-verifiable + adversarial + multi-modal), match task-specific judges, or is there an irreducible fragmentation cost?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines