Reinforcement Learning for LLMs Agentic and Multi-Agent Systems LLM Reasoning and Architecture

Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Note · 2026-02-22 · sourced from Training Fine Tuning
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

RLAD addresses a structural problem with current reasoning training: RL incentivizes depth (longer chains attempting to verify one strategy) but not breadth (exploring diverse strategies). Long chains degenerate into frequent logic switches and unfocused exploration — the "underthinking" failure mode. Since Why do reasoning LLMs fail at deeper problem solving?, merely extending chains doesn't help.

The solution: reasoning abstractions — concise natural language descriptions of procedural and factual knowledge that function as high-level subgoals. Two models are jointly trained:

  1. Abstraction generator: given a problem, propose multiple reasoning abstractions (strategies, intermediate lemmas, relevant principles)
  2. Solution generator: conditioned on an abstraction, generate a solution that utilizes its information

The abstraction generator is rewarded for the improvement in solution accuracy that conditioning on its abstractions produces. The solution generator is rewarded for accuracy when using the abstraction. This cooperative two-player RL setup decouples learning signals: abstraction proposal and solution execution develop separately.

The key scaling result: allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions — at large test budgets. This challenges the standard parallel sampling approach (generate N solutions, pick the best). Instead: generate diverse abstractions, then one good solution per abstraction. The abstractions enforce breadth where depth-only chains fail.

This connects to Why does parallel reasoning outperform single chain thinking? — abstractions are a mechanism for structured parallel exploration. And to Does separating planning from execution improve reasoning accuracy? — abstractions are a learned, RL-trained form of decomposition rather than a fixed prompt scaffold. In terms of the Can reasoning topologies be formally classified as graph types?, RLAD creates a two-level structure: parallel abstraction nodes (breadth-first, like CoT-SC) each conditioning a single depth-first solution chain (like CoT), producing a learned GoT-like topology where aggregation happens at the abstraction level.

The warmstart from SFT (summarize multiple candidate solutions → generate diverse abstractions) followed by RL refinement mirrors the Why does SFT-then-RL training follow a predictable three-phase pattern? dynamic, but in a cooperative multi-agent setting.


Source: Training Fine Tuning

Related concepts in this collection

Concept map
14 direct connections · 167 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reasoning abstractions decompose exploration into breadth-first strategy discovery and depth-first solution generation via two-player rl