LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does reinforcement learning teach social reasoning or just shortcuts?

When RL optimizes for accuracy on theory of mind tasks, do models actually learn to track mental states, or do they find faster paths to correct answers? The distinction matters for genuine reasoning capability.

Note · 2026-02-22 · sourced from Theory of Mind

Rule-based RL has proven effective for enhancing structured reasoning in math and coding. The question is whether it generalizes to social reasoning — "interpreting mental states and hidden commonsense" — where rules and ground truths are less well-defined.

The answer is scale-dependent.

7B models: RL induces high-quality, interpretable, and transferable belief-tracking behaviors. The reasoning traces show explicit step-by-step mental state tracking: identifying what each agent knows, what each agent believes about what others know, and how beliefs update as the story progresses. This transfers across benchmarks.

≤3B models: RL leads to reasoning collapse. Despite achieving "substantial accuracy gains comparable to the larger models," these models "failed to generate interpretable, structured reasoning traces." Instead, they produce "drastically shortened, less meaningful responses" — suggesting reliance on "implicit rather than explicit structured reasoning." They appear to have internalized "alternative rules or patterns that are effective for the specific structures found in benchmark datasets."

The mechanism: simple rule-based rewards optimize for correctness, but in models with limited capacity relative to task complexity, this "may inadvertently encourage shortcut learning." The model finds a faster path to the right answer that doesn't involve actually tracking mental states. It works on benchmarks but wouldn't generalize to genuine social interaction.

This creates a "crucial mismatch between achieving high accuracy on benchmark questions and possessing genuine, human-like reasoning capabilities." The mismatch is invisible if you only look at accuracy scores — the 3B model looks comparable to the 7B model. It becomes visible only when you inspect the reasoning traces.

The finding extends the entropy collapse dynamic from formal reasoning to social reasoning, but with an important twist: in formal domains, shortcut learning tends to reduce diversity while maintaining some reasoning structure. In social reasoning, it eliminates reasoning structure entirely while preserving accuracy — a more severe form of collapse.

Source: Theory of Mind

Related concepts in this collection

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the ToM reasoning collapse is an extreme form of entropy collapse: not just reduced diversity but elimination of interpretable reasoning
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
the scale-dependent finding adds a caveat: RL teaches when to activate only if the model has sufficient capacity; below threshold, RL teaches shortcuts instead
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the 7B success suggests latent ToM capability exists at scale; the 3B failure suggests it doesn't exist below a capacity threshold for social reasoning

Concept map

12 direct connections · 145 in 2-hop network ·dense cluster

Does reinforcement learning teach social reasoni… Does policy entropy collapse limit reasoning perfo… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

rl on ToM produces scale-dependent reasoning collapse — large models develop belief-tracking while small models achieve accuracy through shortcuts