LLM Reasoning and Architecture

Can models treat long prompts as external code environments?

Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?

Note · 2026-02-23 · sourced from Inference time scaling
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Context rot — quality degradation as context lengthens — affects even frontier models like GPT-5. Extending context windows is an arms race: each increase buys more capacity but doesn't solve the fundamental problem that attention-based processing degrades with length. Recursive Language Models sidestep this entirely by changing where the context lives.

The key insight: long prompts should not be fed into the transformer directly. Instead, they should be treated as part of an external environment that the model can symbolically interact with. In the RLM implementation, the prompt is stored as a variable in a Python REPL. The model reads, filters, chunks, and queries its context through code execution rather than token-space attention.

Two mechanisms make this work:

Model priors enable context filtering without seeing it. The model uses its existing knowledge to construct targeted queries — regex searches for keywords, printing specific line ranges to inspect, narrowing the search space based on task understanding. It doesn't need to attend to 100K tokens to find the relevant 500. This is analogous to how humans skim a long document: prior knowledge guides where to look.

Recursive sub-calls defer unbounded reasoning chains. When the context requires reasoning over multiple chunks, the model spawns sub-RLM calls, each operating on a manageable portion. The decomposition is dynamic — the model decides how to partition based on what it observes, not a predefined chunking strategy.

Results: RLMs handle inputs up to two orders of magnitude beyond model context windows. On shorter prompts (within context limits), RLMs still dramatically outperform base models and common long-context scaffolds including context compaction. The cost is comparable or cheaper per query because the model processes only the relevant portions of context rather than attending to everything.

This connects to Can models precompute answers before users ask questions? as a second reframing of compute allocation: sleep-time asks WHEN to compute (before vs during query); RLMs ask WHERE to keep the data (model's context vs external environment). Both reject the default of "stuff everything into the context window at query time."


Source: Inference time scaling

Related concepts in this collection

Concept map
14 direct connections · 137 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

recursive language models treat long prompts as external environment enabling programmatic interaction 100x beyond context windows