LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can models learn to plan without changing their architecture?

Explores whether embedding future information directly into training data can teach language models to plan and reason about goals, without modifying the underlying neural architecture or training algorithms.

Note · 2026-02-22 · sourced from LLM Architecture

TRELAWNEY (2504.11336) identifies a structural mismatch in causal language model training: each token is predicted from previous context, but in human writing and reasoning, goals are typically known before exact arguments or phrasings. Teacher forcing compounds this — it accelerates training by providing correct previous output, but models trained this way latch onto local patterns and surface-level correlations rather than learning long-range dependencies.

The fix is data-centric rather than architectural. TRELAWNEY augments training data by interleaving special lookahead tokens (<T> and </T>) that encapsulate future information. The placement and content of these tokens can be random or task-specific. The model learns from modified training data using the standard training infrastructure — no architecture changes, no additional training tricks.

The results span planning, algorithmic reasoning, and story generation. The model's goal generation capability — a natural byproduct of the training augmentation — can further improve planning and reasoning when used at inference time. This training-time goal conditioning is the complement of Does planning backward help when goals have bottlenecks?, which provides goal information at inference time by reversing search direction — TRELAWNEY internalizes backward planning's benefits during training.

This is a different intervention than multi-token prediction (Bachmann & Nagarajan, 2024; Gloeckle et al., 2024), which forces simultaneous prediction of multiple future tokens. Multi-token prediction modifies the training objective and often the architecture. TRELAWNEY modifies only the training data, making it compatible with existing infrastructure and scalable to any model size.

Since Does training data format shape reasoning strategy more than domain?, TRELAWNEY is evidence that format intervention at the training data level can have architectural-level effects. The lookahead tokens create a new "format" that teaches the model to condition generation on future goals — changing its reasoning strategy from purely autoregressive to goal-conditioned.

The connection to Can backward reasoning during training improve forward reasoning? is complementary: backward reasoning provides consistency checking from the end state, while lookahead tokens provide goal information from the future. Both address the forward-only limitation of standard NTP from different angles.

Source: LLM Architecture

Related concepts in this collection

Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
data-level format intervention with architectural-level effects
Can backward reasoning during training improve forward reasoning? This explores whether training models to reason backward—generating inverse questions and backward reasoning paths—builds internal consistency checking that transfers to forward-only inference without test-time overhead.
complementary future-information injection
Can training data itself teach harder reasoning steps? Can augmenting pretraining data with generated reasoning trajectories help models learn complex multi-step reasoning more efficiently? This explores whether intermediate explanations in training data unlock capabilities standard next-token prediction misses.
both are data-centric training augmentations
Does planning backward help when goals have bottlenecks? Can language models exploit structural asymmetries in planning problems by reversing the search direction? This matters because most planning research assumes forward-only generation, potentially missing efficiency gains when bottlenecks constrain early possibilities.
both address the forward-only limitation of autoregressive generation: TRELAWNEY injects goal/future information into training data so the model learns to condition on goals, while backward planning reverses the search direction at inference time; TRELAWNEY could be seen as training the model to internalize the benefits backward planning provides at test time
Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
thought anchors (especially planning sentences) may be the behavioral manifestation of TRELAWNEY-like goal conditioning: the model generates planning sentences that function as self-imposed lookahead tokens, conditioning subsequent generation on anticipated goals

Concept map

13 direct connections · 127 in 2-hop network ·dense cluster

Can models learn to plan without changing their … Does training data format shape reasoning strategy… Can backward reasoning during training improve for… Can training data itself teach harder reasoning st… Does planning backward help when goals have bottle… Which sentences actually steer a reasoning trace?

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

data-centric lookahead tokens enable planning without architectural changes by embedding future information in training sequences