LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does reinforcement learning teach social reasoning or just shortcuts?

When RL optimizes for accuracy on theory of mind tasks, do models actually learn to track mental states, or do they find faster paths to correct answers? The distinction matters for genuine reasoning capability.

Note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? Where exactly do reasoning models fail and break? Why do LLMs excel at social norms yet fail at theory of mind?

Rule-based RL has proven effective for enhancing structured reasoning in math and coding. The question is whether it generalizes to social reasoning — "interpreting mental states and hidden commonsense" — where rules and ground truths are less well-defined.

The answer is scale-dependent.

7B models: RL induces high-quality, interpretable, and transferable belief-tracking behaviors. The reasoning traces show explicit step-by-step mental state tracking: identifying what each agent knows, what each agent believes about what others know, and how beliefs update as the story progresses. This transfers across benchmarks.

≤3B models: RL leads to reasoning collapse. Despite achieving "substantial accuracy gains comparable to the larger models," these models "failed to generate interpretable, structured reasoning traces." Instead, they produce "drastically shortened, less meaningful responses" — suggesting reliance on "implicit rather than explicit structured reasoning." They appear to have internalized "alternative rules or patterns that are effective for the specific structures found in benchmark datasets."

The mechanism: simple rule-based rewards optimize for correctness, but in models with limited capacity relative to task complexity, this "may inadvertently encourage shortcut learning." The model finds a faster path to the right answer that doesn't involve actually tracking mental states. It works on benchmarks but wouldn't generalize to genuine social interaction.

This creates a "crucial mismatch between achieving high accuracy on benchmark questions and possessing genuine, human-like reasoning capabilities." The mismatch is invisible if you only look at accuracy scores — the 3B model looks comparable to the 7B model. It becomes visible only when you inspect the reasoning traces.

The finding extends the entropy collapse dynamic from formal reasoning to social reasoning, but with an important twist: in formal domains, shortcut learning tends to reduce diversity while maintaining some reasoning structure. In social reasoning, it eliminates reasoning structure entirely while preserving accuracy — a more severe form of collapse.


Source: Theory of Mind

Related concepts in this collection

Concept map
12 direct connections · 145 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl on ToM produces scale-dependent reasoning collapse — large models develop belief-tracking while small models achieve accuracy through shortcuts