LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can models learn to plan without changing their architecture?

Explores whether embedding future information directly into training data can teach language models to plan and reason about goals, without modifying the underlying neural architecture or training algorithms.

Note · 2026-02-22 · sourced from LLM Architecture
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

TRELAWNEY (2504.11336) identifies a structural mismatch in causal language model training: each token is predicted from previous context, but in human writing and reasoning, goals are typically known before exact arguments or phrasings. Teacher forcing compounds this — it accelerates training by providing correct previous output, but models trained this way latch onto local patterns and surface-level correlations rather than learning long-range dependencies.

The fix is data-centric rather than architectural. TRELAWNEY augments training data by interleaving special lookahead tokens (<T> and </T>) that encapsulate future information. The placement and content of these tokens can be random or task-specific. The model learns from modified training data using the standard training infrastructure — no architecture changes, no additional training tricks.

The results span planning, algorithmic reasoning, and story generation. The model's goal generation capability — a natural byproduct of the training augmentation — can further improve planning and reasoning when used at inference time. This training-time goal conditioning is the complement of Does planning backward help when goals have bottlenecks?, which provides goal information at inference time by reversing search direction — TRELAWNEY internalizes backward planning's benefits during training.

This is a different intervention than multi-token prediction (Bachmann & Nagarajan, 2024; Gloeckle et al., 2024), which forces simultaneous prediction of multiple future tokens. Multi-token prediction modifies the training objective and often the architecture. TRELAWNEY modifies only the training data, making it compatible with existing infrastructure and scalable to any model size.

Since Does training data format shape reasoning strategy more than domain?, TRELAWNEY is evidence that format intervention at the training data level can have architectural-level effects. The lookahead tokens create a new "format" that teaches the model to condition generation on future goals — changing its reasoning strategy from purely autoregressive to goal-conditioned.

The connection to Can backward reasoning during training improve forward reasoning? is complementary: backward reasoning provides consistency checking from the end state, while lookahead tokens provide goal information from the future. Both address the forward-only limitation of standard NTP from different angles.


Source: LLM Architecture

Related concepts in this collection

Concept map
13 direct connections · 127 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

data-centric lookahead tokens enable planning without architectural changes by embedding future information in training sequences