INQUIRING LINE

What are the six types of reasoning steps that appear in chain-of-thought?

This explores a specific claim in the corpus — that reasoning inside chain-of-thought can be sorted into six distinct kinds of steps — and what that taxonomy is for.


This question points at one paper in particular: a framework called PI (prompt intervention) that breaks the reasoning a model does mid-chain into six categories of step, then watches which ones actually matter Can reasoning steps be dynamically pruned without losing accuracy?. The headline result is more interesting than the list itself — using the model's own attention maps, the researchers found that some step types (notably verification and backtracking, the 'let me double-check' and 'wait, that's wrong' moves) barely get looked at by anything downstream. Keeping only the high-attention steps let them strip out roughly 75% of the reasoning while holding accuracy steady. So the six-way split isn't just bookkeeping; it's a way of asking which reasoning moves are load-bearing and which are theater.

That 'theater' worry is exactly where the rest of the corpus gets pointed. Several notes argue that chain-of-thought steps often don't cause the answer they appear to justify: chains can fail both causal sufficiency (the steps don't always matter) and causal necessity (spurious steps creep in) Do language models actually use their reasoning steps?, and in multi-step agent pipelines the apparent quality of a reasoning chain is only weakly correlated with whether the output is right Does chain-of-thought reasoning actually explain AI decisions?. A categorization that flags low-attention step types is, in effect, a tool for finding the parts of the chain that are decorative.

The more surprising thing is that PI's six categories are only one of several competing 'periodic tables' of reasoning the corpus holds. One note classifies whole reasoning topologies — chain, tree, graph — as formal graph types, where the structure isn't a metaphor but determines what the computation can express Can reasoning topologies be formally classified as graph types?. Another models long chains as having a 'molecular bond' structure with three interaction types — deep reasoning, self-reflection, self-exploration — and finds that mixing these from different teacher models destabilizes training Does long chain of thought reasoning follow molecular bond patterns?. A third decomposes CoT performance not by step type at all but by three hidden forces: raw output probability, memorization, and genuinely error-accumulating reasoning What three separate factors drive chain-of-thought performance?.

What you didn't ask but might want: the number of categories is a research choice, not a fact about the model. Six, three, three-by-topology — each taxonomy is built to answer a different question (what to prune, what to train on, what drives the score). And a quieter line in the corpus suggests the whole exercise of categorizing visible steps may be optional: latent-reasoning models solve hard puzzles entirely in hidden computation, with no verbalized steps to categorize at all Can models reason without generating visible thinking steps?. The six types are best read as a map of which spoken reasoning moves earn their keep — not as the anatomy of how the model actually thinks.


Sources 7 notes

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Do language models actually use their reasoning steps?

LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.

Does chain-of-thought reasoning actually explain AI decisions?

Research shows that CoT reasoning quality is weakly correlated with output correctness in agentic pipelines. Chains generate analyzable material that appears coherent but doesn't causally produce outputs, creating false confidence in explainability.

Can reasoning topologies be formally classified as graph types?

CoT, ToT, and GoT map precisely to path graphs, trees, and arbitrary directed graphs respectively. The topology is not metaphorical but defines actual computational structure—GoT's in-degree > 1 enables divide-and-conquer synthesis that trees cannot express.

Does long chain of thought reasoning follow molecular bond patterns?

Deep-Reasoning (covalent), Self-Reflection (hydrogen bonds), and Self-Exploration (van der Waals forces) form stable distributions in effective Long CoT. Mixing these stable structures from different teachers destabilizes learning despite matched performance metrics.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Can models reason without generating visible thinking steps?

Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.

Next inquiring lines