Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does planning backward help when goals have bottlenecks?

Can language models exploit structural asymmetries in planning problems by reversing the search direction? This matters because most planning research assumes forward-only generation, potentially missing efficiency gains when bottlenecks constrain early possibilities.

Note · 2026-02-22 · sourced from Reasoning Architectures

Most LLM planning research studies forward direction only — generating steps from initial state toward goal. But many planning problems exhibit an inherent directional asymmetry: generating the correct final steps leading to the goal can be much easier than generating the correct steps from the beginning. This asymmetry is driven by bottlenecks near the goal.

The canonical example: a robot navigating to a bedroom at the end of a narrow hallway. Planning backward from the bedroom, the first step is constrained by the hallway (one possible path). Planning forward from the start, possibilities fan out quickly before the hallway constraint appears. The backward direction is easier because the bottleneck constrains the search space earlier in the backward chain.

The LLM finding: planning performance correlates with the planning complexity of the problem in that direction. This means which direction is easier is problem-specific, not universal. The paper demonstrates this holds for LLM planning, not just analytical planning theory.

However, backward planning in LLMs is systematically biased — models exhibit degraded performance when asked to plan in the backward direction directly (mirroring the difficulty humans have with backward reasoning intuitively). The solution is to flip the problem: invert the goal/start, then plan forward in the flipped problem. This avoids the backward bias while exploiting the backward direction's structural advantage.

Results: Combining planning in both directions with self-verification improves overall planning success by 4–24% across three planning domains. The diversity of candidate plans (forward + backward together) exceeds either direction alone.

This connects to the insight that How should we balance parallel versus sequential compute at test time? — but here the dimension is directional rather than just parallel/sequential. Generating diverse candidates by exploring different directions is a form of parallel planning.

Source: Reasoning Architectures

Related concepts in this collection

How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
backward+forward as directional parallelism; generates diverse candidates via direction diversity rather than independent sampling
Can backward reasoning during training improve forward reasoning? This explores whether training models to reason backward—generating inverse questions and backward reasoning paths—builds internal consistency checking that transfers to forward-only inference without test-time overhead.
companion note: training-time backward reasoning; this note covers test-time backward planning
Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
backtracking in CoT may be acting as micro-backward-planning; anchors = local direction reversals
Can models learn to plan without changing their architecture? Explores whether embedding future information directly into training data can teach language models to plan and reason about goals, without modifying the underlying neural architecture or training algorithms.
complementary approach to the same problem: TRELAWNEY provides goal information at training time via embedded future tokens, while backward planning provides it at inference time by reversing search direction; TRELAWNEY-trained models may internalize backward planning benefits, making explicit backward search less necessary
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
backward+forward is an instance of the parallel diversity principle: directional diversity provides structurally different candidates that independent same-direction sampling cannot reach, extending the parallel advantage beyond random seed diversity to problem-structural diversity

Concept map

13 direct connections · 144 in 2-hop network ·dense cluster

Does planning backward help when goals have bott… How should we balance parallel versus sequential c… Can backward reasoning during training improve for… Which sentences actually steer a reasoning trace? Can models learn to plan without changing their ar… Why does parallel reasoning outperform single chai…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

backward planning reduces difficulty when goal states have bottlenecks by constraining the early search space