What makes thought identifiability provable without auxiliary training data?

This reads the question as: can we locate and trigger a model's reasoning *inside what it already learned* — without bolting on extra training data — and why that's demonstrable rather than just plausible.

This explores whether "thought" can be identified and switched on inside a model using only what's already in its weights, with no auxiliary training data — and the corpus makes a surprisingly strong case that the answer is yes. The unifying claim is that post-training doesn't *create* reasoning; it *selects* reasoning that the base model already contains. Do base models already contain hidden reasoning ability? is the anchor: five independent interventions — RL steering, critique fine-tuning, decoding changes, SAE feature steering, RLVR — all unlock the same latent capability, which is the kind of convergent evidence that turns a hunch into something close to proof. If five unrelated keys open the same door, the room was already there.

What makes it *identifiable* — not just present — is that the reasoning turns out to live at a specific, manipulable address. Can we trigger reasoning without explicit chain-of-thought prompts? shows a single sparse-autoencoder feature can be steered to match or beat chain-of-thought across six model families, activating early in generation and overriding surface instructions. Reasoning isn't smeared diffusely across the network; it's a feature you can point at. Can we steer reasoning toward brevity without retraining? sharpens this: even the *style* of reasoning (verbose vs. terse) is a single linear direction, extractable from 50 paired examples, fully training-free. When a behavior collapses to a direction in activation space, you've effectively proven you've identified it — you can add it, subtract it, and watch the output move.

The "without auxiliary training data" part is where energy-based and latent approaches come in. Can energy minimization unlock reasoning without domain-specific training? reaches deliberative, System-2-style thinking purely by minimizing an energy score at inference — no domain-specific scaffolding, no labeled reasoning traces. Can models reason without generating visible thinking tokens? and Can models reason without generating visible thinking steps? go further: a 27M-parameter recurrent model solves Sudoku-Extreme and large mazes by iterating hidden states, while chain-of-thought scores zero. The reasoning happens in continuous latent space and never has to be spelled out in tokens — verbalization, these notes argue, is a training artifact, not a requirement of thinking.

Here's the part you might not have expected to care about: the same corpus that proves latent thought is *real* also argues that the visible thought is partly *fake*. Does chain-of-thought reasoning reveal genuine inference or pattern matching? and Does chain-of-thought reasoning actually generalize beyond training data? show chain-of-thought reproduces familiar reasoning *forms* from training and degrades predictably off-distribution — fluent but logically inconsistent. And Can minimal reasoning chains match full explanations? finds 92% of CoT tokens do style and documentation work, not computation. So the written-out reasoning is the unreliable, data-hungry surface; the steerable latent feature is the robust, data-free signal. That inversion — the visible explanation is the imitation, the hidden direction is the real thing — is what makes identifiability provable without auxiliary data. You're not training a model to reason; you're locating a capability that was already there and showing you can flip it like a switch.

If you want the deeper philosophical edge of this, Do large language models genuinely simulate mental states? and Can we defend modest mental attributions to large language models? ask the harder question lurking underneath: even if you can *identify* and steer an internal "thought" direction, does locating it mean the model is genuinely thinking — or just that you've found the lever for a very good imitation?

Sources 11 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can models reason without generating visible thinking steps?

Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can we defend modest mental attributions to large language models?

Both robustness and etiological deflationist arguments beg the question against inflationism. A graded approach ascribing metaphysically undemanding states like beliefs and desires—while withholding consciousness claims—mirrors how we treat non-human animals.

What makes thought identifiability provable without auxiliary training data?

Sources 11 notes

Next inquiring lines