INQUIRING LINE

Why do chain-of-thought outputs look logical but perform rhetorically?

This explores why chain-of-thought traces read like step-by-step logic but actually work by persuasion and pattern-matching rather than genuine inference.


This explores why chain-of-thought (CoT) traces read like airtight logic but actually behave like rhetoric — convincing form over genuine inference. The corpus points to a single root cause: CoT learns the *shape* of reasoning, not reasoning itself. The most direct evidence is that logically invalid CoT prompts perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?. If broken logic still delivers the gains, then the logic was never doing the work — the format was. Broader analyses confirm this: training format shapes reasoning strategy 7.5× more than the actual domain, and demo position alone swings accuracy 20% What makes chain-of-thought reasoning actually work?. CoT is pattern-guided generation dressed in the costume of deduction Does chain-of-thought reasoning reveal genuine inference or pattern matching?, What makes chain-of-thought reasoning actually work?.

That costume is why outputs *look* logical. But why do they perform *rhetorically* — persuading rather than proving? Because the visible chain and the actual computation are decoupled. Faithfulness studies show reasoning chains routinely fail both causal sufficiency (the steps often don't matter to the answer) and causal necessity (spurious steps creep in), so most evaluations measure how good the output *looks*, not whether the reasoning caused it Do language models actually use their reasoning steps?. In agentic pipelines this becomes explicit: plausible-looking chains regularly precede wrong answers, reviewer scores barely correlate with quality, and the chain only 'explains' the failure in hindsight — explanation without explainability Does chain of thought reasoning actually explain model decisions?. The text reads as justification but functions as performance.

There's a sharper twist: much of a chain is literally decorative. Concise reasoning matches verbose CoT accuracy at 7.6% of the token cost — meaning roughly 92% of the words served style and documentation, not computation Can minimal reasoning chains match full explanations?. Probing work makes the timing visible: on easy tasks models commit to an answer internally *before* finishing the reasoning, then generate the chain as after-the-fact narration; only on genuinely hard tasks does the reasoning track real belief updates Does chain-of-thought reasoning reflect genuine thinking or performance?. So 'performative' isn't a metaphor — it's measurable, and it's difficulty-dependent.

The lateral payoff is a vocabulary you may not have known you wanted: rhetoric. One note maps Aristotle's logos/ethos/pathos onto AI explanation design, showing every explanation loads all three persuasive channels at once whether designers intend it or not How do logos, ethos, and pathos shape AI explanations?. A chain of thought presents as pure *logos* (appeal to logic) while quietly running on *ethos* — the credibility of looking like a careful thinker. That's exactly the gap between looking logical and being logical. And a decomposition study explains why the appearance isn't *pure* theater: CoT performance splits into output probability, memorization, and genuinely noisy reasoning that accumulates error step by step — the model really does reason a little, it just also leans heavily on what's probable and what it has seen before What three separate factors drive chain-of-thought performance?.

If you want to pull one more thread: the form-over-content story also predicts *when* CoT helps. Optimal chain length follows an inverted U and shrinks as models get more capable Why does chain of thought accuracy eventually decline with length?, and for simple questions the step-by-step scaffold can actively hurt unless the question's meaning flows into the prompt first Why do some questions perform better without step-by-step reasoning?. More performed reasoning isn't more thinking — past a point, it's just more rhetoric.


Sources 12 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Do language models actually use their reasoning steps?

LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.

Does chain of thought reasoning actually explain model decisions?

Reviewer scores for reasoning chains are weakly correlated with response quality in multi-LLM pipelines. Plausible-looking reasoning often precedes incorrect outputs, and chains reflect failures only in retrospect, making them poor explanations despite appearing coherent.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Does chain-of-thought reasoning reflect genuine thinking or performance?

Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.

How do logos, ethos, and pathos shape AI explanations?

Aristotle's three appeals map onto explanation design across two goals (how AI works, why AI merits use), creating a 3×2 space where every explanation loads all three channels simultaneously. Naming these rhetorical channels lets designers account for unintended persuasive effects.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Next inquiring lines