Reinforcement Learning for LLMs Agentic and Multi-Agent Systems LLM Reasoning and Architecture

Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Note · 2026-02-22 · sourced from Training Fine Tuning

RLAD addresses a structural problem with current reasoning training: RL incentivizes depth (longer chains attempting to verify one strategy) but not breadth (exploring diverse strategies). Long chains degenerate into frequent logic switches and unfocused exploration — the "underthinking" failure mode. Since Why do reasoning LLMs fail at deeper problem solving?, merely extending chains doesn't help.

The solution: reasoning abstractions — concise natural language descriptions of procedural and factual knowledge that function as high-level subgoals. Two models are jointly trained:

Abstraction generator: given a problem, propose multiple reasoning abstractions (strategies, intermediate lemmas, relevant principles)
Solution generator: conditioned on an abstraction, generate a solution that utilizes its information

The abstraction generator is rewarded for the improvement in solution accuracy that conditioning on its abstractions produces. The solution generator is rewarded for accuracy when using the abstraction. This cooperative two-player RL setup decouples learning signals: abstraction proposal and solution execution develop separately.

The key scaling result: allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions — at large test budgets. This challenges the standard parallel sampling approach (generate N solutions, pick the best). Instead: generate diverse abstractions, then one good solution per abstraction. The abstractions enforce breadth where depth-only chains fail.

This connects to Why does parallel reasoning outperform single chain thinking? — abstractions are a mechanism for structured parallel exploration. And to Does separating planning from execution improve reasoning accuracy? — abstractions are a learned, RL-trained form of decomposition rather than a fixed prompt scaffold. In terms of the Can reasoning topologies be formally classified as graph types?, RLAD creates a two-level structure: parallel abstraction nodes (breadth-first, like CoT-SC) each conditioning a single depth-first solution chain (like CoT), producing a learned GoT-like topology where aggregation happens at the abstraction level.

The warmstart from SFT (summarize multiple candidate solutions → generate diverse abstractions) followed by RL refinement mirrors the Why does SFT-then-RL training follow a predictable three-phase pattern? dynamic, but in a cooperative multi-agent setting.

Source: Training Fine Tuning

Related concepts in this collection

Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
the problem RLAD addresses: depth without breadth
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
abstractions enforce structured parallel exploration
Does separating planning from execution improve reasoning accuracy? Explores whether modularizing decomposition and solution into separate models prevents interference and boosts performance compared to monolithic approaches.
abstractions as learned decomposition
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
abstractions may resist entropy collapse by maintaining strategy diversity
Can reasoning topologies be formally classified as graph types? This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.
RLAD creates a distinct topology: a two-level graph where the abstraction generator produces parallel breadth nodes (like CoT-SC) and each abstraction conditions a depth-first solution chain (like CoT); the result is a learned GoT-like structure where aggregation (in-degree > 1) happens at the abstraction level rather than at the solution level

Concept map

14 direct connections · 167 in 2-hop network ·dense cluster

Can abstractions guide exploration better than d… Why do reasoning LLMs fail at deeper problem solvi… Why does parallel reasoning outperform single chai… Does separating planning from execution improve re… Does policy entropy collapse limit reasoning perfo… Can reasoning topologies be formally classified as…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

reasoning abstractions decompose exploration into breadth-first strategy discovery and depth-first solution generation via two-player rl