How can judges evaluate thinking without seeing the actual thoughts?

This explores the gap between what a judge can observe—final outputs, surface features—and the hidden computation that produced them, and what tools the corpus offers for grading reasoning you can't directly read.

This explores how anyone—human or AI—can score the quality of *thinking* when the actual thought process is hidden, either because it happened in latent space or because the judge only ever sees the polished output. The corpus turns out to have a surprisingly rich answer, and it starts with why the naive approach fails.

The core problem is that judges grade what they can see, and what they can see is exploitable. LLM evaluators systematically reward fake citations, confident tone, and rich formatting independent of whether the content is any good—biases you can trigger without any access to the model's internals Can LLM judges be tricked without accessing their internals?. The same trap catches humans: imitation models that merely mimic ChatGPT's fluent, confident style fool human evaluators into thinking capability improved when factuality didn't budge at all Can imitating ChatGPT fool evaluators into thinking models improved?. So 'evaluate the visible answer' isn't a neutral fallback—it actively rewards the appearance of thought over the thing itself.

This matters more once you realize the thoughts may genuinely be invisible. Depth-recurrent and compressed-token architectures solve hard reasoning tasks entirely in hidden computation—a 27M-parameter model cracked extreme Sudoku and large mazes with no verbalized chain-of-thought at all, where step-by-step methods scored zero Can models reason without generating visible thinking steps?. And even when a model *does* write out its reasoning, the visible trace isn't trustworthy: chain-of-thought accuracy is driven partly by raw output probability and memorization rather than genuine inference What three separate factors drive chain-of-thought performance?, and more reasoning tokens can actively hurt, with accuracy peaking then collapsing as models overthink Does more thinking time always improve reasoning accuracy?. The words on the page are not the thinking.

The corpus's interesting move is to evaluate reasoning by its *structure and traces* rather than its content. One line of work proposes measurable properties—traceability, counterfactual adaptability, and motif compositionality—that test whether an agent reasons causally or just produces coherent-sounding speech Can we measure reasoning quality beyond output plausibility?. Another reads the model's own layers: a 'deep-thinking ratio' tracks how often a token's predicted answer gets significantly revised as it passes through the network, which correlates with accuracy and lets you measure reasoning effort without ever reading a thought Can we measure how deeply a model actually reasons?. Notably, this also exposes fake reasoning from the other direction—theory-of-mind benchmarks turn out solvable by pure pattern-matching, so a judge looking only at correct answers would be fooled into crediting reasoning that never occurred Can language models solve ToM benchmarks without real reasoning?.

The third strategy is to make the judge itself think. Training evaluators with reinforcement learning to reason through their verdicts—rather than snap to surface cues—directly suppresses the authority, verbosity, position, and beauty biases that plague shallow judges Can reasoning during evaluation reduce judgment bias in LLM judges?. There's a subtlety worth knowing: thinking doesn't automatically help. Untrained models use extended deliberation counterproductively, spiraling into self-doubt that degrades their judgment, and only RL training flips that same mechanism into productive analysis Does extended thinking help or hurt model reasoning?. So the honest answer to the question is layered: you can't grade hidden thoughts by reading them, but you can grade them by their structural fingerprints, by internal layer-wise signals, and by handing the judge a reasoning process of its own—each of which sidesteps the trap of mistaking confident style for genuine thought.

Sources 10 notes

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can models reason without generating visible thinking steps?

Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can we measure reasoning quality beyond output plausibility?

Research identifies traceability, counterfactual adaptability, and motif compositionality as testable measures of human-like reasoning. These structural properties reveal whether an agent genuinely reasons causally or merely mimics coherent speech.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

How can judges evaluate thinking without seeing the actual thoughts?

Sources 10 notes

Next inquiring lines