What makes chain-of-thought reasoning actually work?

Maps the structural properties of chain-of-thought reasoning including topology, format effects, thought anchors, and token-level mechanisms.

Topic Hub · 36 linked notes · 3 sections

View as

Reasoning Methods and CoT Structure

20 notes

Can reasoning topologies be formally classified as graph types?

This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.

When does sequential reasoning beat parallel voting?

Explores whether sequential chain-of-thought reasoning or parallel voting is more effective for different problem types. Understanding this trade-off helps predict which test-time compute strategy will work best.

Can minimal reasoning chains match full explanations?

Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.

Can we steer reasoning toward brevity without retraining?

This explores whether model reasoning style occupies learnable geometric directions in activation space, and whether we can shift toward concise thinking by steering through that space without expensive retraining.

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

How quickly do errors compound during model self-training?

When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.

Does training data format shape reasoning strategy more than domain?

What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.

How much does demo position alone affect in-context learning accuracy?

Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?

Which sentences actually steer a reasoning trace?

Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Where do memorization errors arise in chain-of-thought reasoning?

Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.

Do reasoning traces actually cause correct answers?

Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

Can cognitive scaffolding improve how models reason about social scenes?

This explores whether structuring visual reasoning through perception, situation, and norm stages—grounded in how humans actually think—helps language models tackle socially complex tasks better than standard reasoning approaches.

Where does LLM reasoning actually happen during generation?

Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.

Can we trigger reasoning without explicit chain-of-thought prompts?

This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.

Can continuous reasoning avoid forgetting in instruction-tuned models?

Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?

Can we measure how deeply a model actually reasons?

What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?

Why do some questions perform better without step-by-step reasoning?

Explores whether chain-of-thought prompting universally improves reasoning or if simpler prompts work better for certain questions. Understanding this matters because it challenges assumptions about how LLMs should be prompted to solve problems.

Reasoning Logic and Internal Rules

10 notes

Do large language models reason symbolically or semantically?

Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

How does multi-hop reasoning develop during transformer training?

Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.

Do large language models use one reasoning style or many?

Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.

Can LLMs reason creatively beyond conventional problem-solving?

Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Does partial formalism work better than full symbolic translation?

Exploring whether injecting limited symbolic structure into natural language preserves reasoning power better than complete formalization. This matters because current neuro-symbolic approaches often lose semantic information during translation.

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.