INQUIRING LINE

What makes a causal abstraction more transferable than a generic heuristic?

This explores what separates a reusable, structure-bearing abstraction (one that holds when conditions change) from a surface pattern that only works because it was seen before — and the corpus has a sharp, slightly uncomfortable answer.


This explores what separates a transferable abstraction from a heuristic that just happens to fire in familiar situations. The shortest version the corpus offers: a causal abstraction encodes *invariant structure* — a mechanism that stays true when the surface details shift — while a heuristic is recall of a training schema that quietly decouples the moment you step outside the distribution it was learned in.

The clearest evidence is what happens at the distribution boundary. Chain-of-thought reasoning degrades *predictably* as tasks, lengths, and formats drift from training data — models keep producing fluent-looking steps while the underlying logic falls apart Does chain-of-thought reasoning actually generalize beyond training data?. A telling tell: reasoning-trace length tracks *how close a problem is to training examples*, not how hard it actually is — in-distribution the two correlate, out-of-distribution they fully decouple Does longer reasoning actually mean harder problems?. That's the signature of a heuristic: it's measuring familiarity, not structure. The broader critique frames CoT as constrained imitation — reproducing the *form* of reasoning rather than performing inference — which is exactly why format effects dominate content and structurally invalid prompts can still succeed Why does chain-of-thought reasoning fail in predictable ways? What makes chain-of-thought reasoning actually work?.

Against that backdrop, what makes an abstraction transferable is that it organizes the search rather than recites an answer. RLAD shows abstractions enforcing structured breadth-first exploration — and at large compute budgets, spending on *diverse abstractions* beats just sampling more solutions, precisely because the abstraction is a reusable scaffold rather than a one-shot guess Can abstractions guide exploration better than depth alone?. LLM Programs make the same point from the engineering side: wrapping a model in explicit algorithmic control flow, handing each step only its relevant context, turns brittle monolithic reasoning into modular, debuggable structure that carries across problems Can algorithms control LLM reasoning better than LLMs alone?. The transferable thing is the *organization of the work*, not the memorized trajectory.

Here's the part you might not expect: the corpus warns that "causal" performance in LLMs can itself be a heuristic wearing better clothes. Models handle causal relations better than temporal ones largely because causal connectives are explicit and frequent in training text, while temporal order is implicit Why do LLMs handle causal reasoning better than temporal reasoning? — so even the apparent causal competence rides on surface statistics. And when you probe the actual reasoning, LLMs reproduce *human* causal biases — weak explaining-away, Markov violations — which points to shared roots in training-data statistics rather than a grasp of mechanism Do large language models make the same causal reasoning mistakes as humans?. So calling something a "causal abstraction" doesn't automatically make it transferable; it has to encode the mechanism, not the co-occurrence.

Two final cautions worth carrying away. First, causal structure is necessary but not sufficient — even a clean causal model leaves out associative, analogical, and emotion-driven reasoning, so abstraction-as-causal-graph is a tractable starting point, not the whole of thought Can causal models alone capture how humans actually reason?. Second, transferability and faithfulness can come apart: fine-tuning can make reasoning steps *less* causally connected to the answer — the chain becomes performative rather than load-bearing — which is the abstraction quietly degrading into a heuristic without the accuracy ever flinching Does fine-tuning disconnect reasoning steps from final answers?. The transferable abstraction is the one whose steps actually drive the outcome when the surface changes; the heuristic is the one that only looked like it did.


Sources 10 notes

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Can causal models alone capture how humans actually reason?

Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Next inquiring lines