Can structured workflows unlock latent reasoning abilities that raw models don't show?

This explores whether structured scaffolds — tools, prompts, graphs, decoding tricks — can pull reasoning out of a model that the raw model has but doesn't display, and what that implies about where reasoning actually lives.

This explores whether structured workflows unlock reasoning that's already latent in a model rather than teaching it something new — and the corpus comes down strongly on the side of *unlocking*. The starting claim is almost startling: base models already contain the reasoning machinery, and post-training mostly *selects* it rather than creating it. Five independent methods — RL steering, critique fine-tuning, decoding changes, feature steering, and RLVR — all reach into base-model activations and elicit reasoning that was already there Do base models already contain hidden reasoning ability?. A complementary line argues RL post-training teaches a model *when* to reason, not *how*: hybrid models recover 91% of the gains just by routing tokens, and the activation vectors for reasoning strategies pre-exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?. If that's true, then the bottleneck isn't capability — it's elicitation. And elicitation is exactly what a workflow is for.

The most direct evidence is that structure alone, with zero training, moves the needle hard. Four 'cognitive tools' implemented as isolated sandboxed calls lifted GPT-4.1 on a hard math benchmark from 26.7% to 43.3% — no RL involved Can modular cognitive tools unlock reasoning without training?. The mechanism is worth pausing on: modularity *enforces* operation isolation that plain prompting can't guarantee. The structure isn't adding intelligence; it's preventing the model from running its steps together and tripping over itself. The same shape shows up when you force argument structure — making a model name its warrants and backing (Toulmin-style) catches reasoning failures that ordinary chain-of-thought waves past Can structured argument prompts make LLM reasoning more rigorous? — and when you externalize reasoning into knowledge-graph triples, which lets a *small* model (GPT-4o mini) jump 29% on hard tasks by making each step inspectable and correctable Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?.

Why does scaffolding help so much? Because a lot of reasoning failure turns out to be organizational, not computational. Reasoning models 'wander' and 'underthink' — they explore invalid paths and abandon promising ones too early — and a simple decoding-level penalty against premature thought-switching improves accuracy with no fine-tuning at all Why do reasoning models abandon promising solution paths?. The viable solution was already reachable; the model just walked away from it. Structure keeps it on the path. There's even a deeper reason the capability is broadly latent: reasoning generalizes because it draws on *procedural* knowledge spread across many pretraining documents, unlike fact recall which depends on narrow memorization Does procedural knowledge drive reasoning more than factual retrieval?. Procedural skill is diffuse and transferable — exactly the kind of thing a workflow can summon and channel.

But here's the part you didn't know you wanted to know: workflows unlock what's *there*, and the corpus is blunt about what isn't. Chain-of-thought may be constrained imitation of reasoning's *form* rather than genuine abstract inference — performance degrades predictably under distribution shift, the fingerprint of pattern-matching, not capability Does chain-of-thought reasoning reveal genuine inference or pattern matching?. And when you strip familiar semantics out of a task, even handing the model the correct rules, performance collapses — models reason through semantic association, not symbolic logic Do large language models reason symbolically or semantically?. So structure can route, isolate, and organize latent ability — but it can't conjure a symbolic faculty the base model never had. The honest synthesis: workflows are a lever on a real but bounded reservoir. They reliably surface dormant competence (and newer work even pushes this *into* the model — stochastic latent transitions let recursive reasoners hold uncertainty and sample parallel solution paths instead of committing too early Can stochastic latent reasoning help models explore multiple solutions?, Can reasoning systems scale wider instead of only deeper?). What they can't do is manufacture reasoning the model fundamentally lacks.

Sources 11 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?

Knowledge Graph of Thoughts (KGoT) achieves 29% improvement on GAIA Level 3 tasks using GPT-4o mini by externalizing reasoning into iteratively constructed KG triples. The approach improves transparency, reduces bias, and enables quality control over reasoning steps.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can structured workflows unlock latent reasoning abilities that raw models don't show?

Sources 11 notes

Next inquiring lines