INQUIRING LINE

Does formal reasoning training actively degrade social reasoning ability?

This explores whether training models to be better at step-by-step formal reasoning (math, logic, chains-of-thought) actually makes them worse at reading minds and social situations — and whether the corpus sees this as an active trade-off or just a coincidence.


This explores whether optimizing a model for formal reasoning actively erodes its social reasoning — and the corpus leans toward yes, but with an important caveat about *why*. The most direct evidence is striking: advanced reasoning models like Claude 3.7 Sonnet and o1 actually score *worse* on theory-of-mind benchmarks than their older, less reasoning-tuned predecessors, falling below not just humans but even simple word-embedding baselines on tasks involving false belief and counterfactuals Why do reasoning models fail at theory of mind tasks? Why do advanced reasoning models fail at understanding minds?. The phrase that captures it: more reasoning effort doesn't help social cognition, and may actively interfere with it.

The deeper claim isn't just "they're bad at it" — it's that social reasoning runs on a *different cognitive architecture* than formal reasoning, so optimizing for one can crowd out the other Why do reasoning models struggle with theory of mind tasks?. Formal reasoning is sequential derivation: one step justifying the next. But tracking what other people believe seems to require holding *several models of minds in parallel* at once — which is why a method like ThoughtTracing, using short Bayesian hypothesis tracking, beats long deliberate reasoning traces. When you train a model to derive answers in a chain, you may literally be reinforcing the wrong shape of computation for the social task.

That framing connects to a broader pattern the corpus keeps surfacing: reasoning training is *selective*, not free. Knowledge and reasoning live in different layers of the network — reasoning adjustment in the higher layers, factual knowledge in the lower ones — which is why the same training that boosts math can degrade knowledge-heavy domains like medicine Why does reasoning training help math but hurt medical tasks?. Social reasoning looks like another casualty of this specialization. And the degradation can be invisible: supervised fine-tuning raises benchmark accuracy while cutting the quality of the actual inferential steps, so models produce right answers through post-hoc rationalization rather than real reasoning Does supervised fine-tuning improve reasoning or just answers?. You only catch it if you inspect the steps, not the score.

But here's the twist that complicates a simple "training is the villain" story: training *can* go the other way when it's pointed at the social task directly. Reinforcement learning applied specifically to theory of mind produces genuine, transferable belief-tracking — but only in models above a certain size; smaller ones fake it with shortcuts that look accurate but lack real reasoning traces Does reinforcement learning on theory of mind collapse with model scale?. This suggests the problem isn't reasoning training as such — it's training on the *wrong objective*. Generic reasoning optimization (math, code, verifiable answers) sculpts a sequential machine that doesn't transfer to minds; targeted social RL can build the parallel machinery, if the model is big enough.

So the honest synthesis: formal reasoning training does appear to actively degrade social reasoning, but as a side effect of specialization rather than a law of nature. The same mechanism cuts both ways — RL can transform a model's internal thinking from counterproductive self-doubt into productive analysis Does extended thinking help or hurt model reasoning?, and even reveals that base models already carry latent capabilities that training merely selects rather than creates Do base models already contain hidden reasoning ability?. The unsettling takeaway you didn't know you wanted: a model can get smarter and worse at understanding you at the same time, and standard benchmarks won't tell you it happened.


Sources 8 notes

Why do reasoning models fail at theory of mind tasks?

Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.

Why do advanced reasoning models fail at understanding minds?

Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.

Why do reasoning models struggle with theory of mind tasks?

Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Next inquiring lines