Can you steer reasoning by directly manipulating SAE features?

This explores whether you can turn reasoning on (or shape it) by reaching inside a model and directly nudging the specific internal features that sparse autoencoders (SAEs) have isolated — rather than coaxing reasoning out through prompting.

This explores whether reasoning can be triggered or steered by directly manipulating SAE-identified features inside a model, instead of prompting it. The short answer from the corpus is: yes, and the result is more surprising than it sounds. Steering a *single* SAE-identified reasoning feature can match or even beat chain-of-thought prompting across six different model families Can we trigger reasoning without explicit chain-of-thought prompts?. The steered reasoning mode kicks in early in generation and even overrides surface-level instructions — meaning the model 'decides' to reason from an internal switch, not from the words you fed it.

The deeper payoff is what this implies about where reasoning lives. If flipping one latent feature unlocks reasoning, the capability was already sitting in the weights, waiting. That's exactly the convergent story the corpus tells: five independent methods — RL steering, critique fine-tuning, decoding tricks, SAE feature steering, and RLVR — all elicit reasoning that's *already present* in base-model activations Do base models already contain hidden reasoning ability?. SAE steering is one doorway into a room the model already built. The bottleneck is elicitation, not teaching. This reframes post-training too: RL appears to teach a model *when* to reason rather than *how*, since reasoning vectors pre-exist before any RL and hybrid models recover 91% of gains just by routing tokens Does RL post-training create reasoning or just deploy it?.

SAE steering is the sharpest version of a broader truth: reasoning behaviors often correspond to *linear directions* you can extract and push on. You can steer reasoning toward brevity by pulling a single vector from 50 paired examples, cutting chain-of-thought length 67% with no retraining Can we steer reasoning toward brevity without retraining?. So 'whether to reason' and 'how verbosely to reason' both turn out to be manipulable directions in activation space — a strong hint that these are organized, accessible features rather than emergent fog.

But here's the twist worth sitting with: directly steerable features don't guarantee a clean internal organization underneath. A model can hold all the linearly decodable features a task needs while its actual internal structure is fractured — perfect accuracy masking representations that shatter under perturbation Can models be smart without organized internal structure?. So steering a feature and getting good output doesn't prove the model reasons coherently; it may just prove that feature is decodable. That caution pairs with evidence that chain-of-thought itself is often imitation of reasoning *form* — invalid reasoning chains score nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and CoT degrades predictably under distribution shift, the signature of pattern-matching rather than genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?.

The thing you didn't know you wanted to know: if a single internal feature can outperform an elaborate prompting strategy, then much of what we call 'prompt engineering for reasoning' may be an indirect, lossy way of toggling switches we could flip directly — and the models we use today are quietly carrying reasoning capacity we mostly don't activate.

Sources 7 notes

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Can you steer reasoning by directly manipulating SAE features?

Sources 7 notes

Next inquiring lines