How do local soundness signals work across different problem domains?
This reads 'local soundness signals' as step-level checks on reasoning quality — confidence, validity, calibration measured at each move rather than averaged over a whole answer — and asks whether those local checks hold up across different kinds of problems.
This explores 'local soundness signals' — measuring whether reasoning is sound move-by-move instead of judging only the final answer — and whether that local view travels across problem domains. The corpus's strongest claim is that it does, and that the local view catches what aggregate scoring hides. Step-level confidence filtering beats global confidence averaging because a single broken step gets washed out when you average across a whole trace; tracking confidence locally catches the breakdown and even lets you stop a trace early before it finishes Does step-level confidence outperform global averaging for trace filtering?. The same logic appears as a temporal signal: models that lock onto an answer early and then rationalize show measurably worse reasoning, and rewarding *gradual* confidence growth instead of premature commitment improved Countdown accuracy by 42 points — without any process labels or external reward model Can confidence trajectories reveal when reasoning goes wrong?.
Why local rather than global? Because aggregate accuracy is a liar that lies consistently across domains. In medical triage, legal interpretation, and financial planning, fluent confident wrong answers concentrate in exactly the rare, high-harm cases — and overall accuracy looks strong the whole time because those cases are rare Why do confident wrong answers hide in standard accuracy metrics?. The same pattern shows up at the representation level: a model can carry every linearly-decodable feature a task needs while its internal organization is fundamentally fractured, leaving it brittle to perturbation in ways no top-line metric reveals Can models be smart without organized internal structure?. And in long delegated workflows, frontier models silently corrupt ~25% of document content across 52 domains, errors compounding without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. The throughline: harm hides at the local level, so you have to look there.
The catch — and the most interesting thing here — is that the most obvious local signal, the reasoning trace itself, is largely fake. Logically *invalid* chain-of-thought exemplars perform nearly as well as valid ones, meaning the model learns the *form* of reasoning, not genuine inference; the trace's apparent soundness is not the source of the gains Does logical validity actually drive chain-of-thought gains?. Worse, reflection is mostly confirmatory theater — across eight models, reflections rarely change the initial answer, traces don't faithfully represent the actual reasoning, and calibration *degrades* under binary reward training Can we actually trust reasoning model outputs?. So a soundness signal read off the surface of the reasoning text is unreliable; the signals that work (step confidence, confidence trajectory) are statistical properties of the generation, not the model's self-narration.
This points to a real design split. Cheap intrinsic signals like confidence trajectory generalize across domains and need no labels, but they measure *the model's own certainty*, which can be confidently wrong. The alternative is to externalize the check: agent-based evaluation with active evidence collection cut judge error 100x over a plain LLM-as-judge on complex tasks — though its memory module cascaded errors, showing that even external checkers need error-isolation to stay sound Can agents evaluate AI outputs more reliably than language models?. And underneath all of it sits a sampling fact worth keeping: even a deterministic, zero-temperature output is still one draw from a distribution, so a 'consistent' answer is not a 'reliable' one Does setting temperature to zero actually make LLM outputs reliable?. Across domains, then, local soundness signals work best when they're statistical and step-resolved rather than narrative — and when you treat the model's confidence as evidence to be checked, not a verdict to be trusted.
Sources 9 notes
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Models that commit to answers early then rationalize show measurable flawed reasoning. Rewarding gradual confidence growth via RL improves accuracy significantly—on Countdown by 42 percentage points—without needing process labels or external reward models.
Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.