Can models treat long prompts as external code environments?
Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?
Context rot — quality degradation as context lengthens — affects even frontier models like GPT-5. Extending context windows is an arms race: each increase buys more capacity but doesn't solve the fundamental problem that attention-based processing degrades with length. Recursive Language Models sidestep this entirely by changing where the context lives.
The key insight: long prompts should not be fed into the transformer directly. Instead, they should be treated as part of an external environment that the model can symbolically interact with. In the RLM implementation, the prompt is stored as a variable in a Python REPL. The model reads, filters, chunks, and queries its context through code execution rather than token-space attention.
Two mechanisms make this work:
Model priors enable context filtering without seeing it. The model uses its existing knowledge to construct targeted queries — regex searches for keywords, printing specific line ranges to inspect, narrowing the search space based on task understanding. It doesn't need to attend to 100K tokens to find the relevant 500. This is analogous to how humans skim a long document: prior knowledge guides where to look.
Recursive sub-calls defer unbounded reasoning chains. When the context requires reasoning over multiple chunks, the model spawns sub-RLM calls, each operating on a manageable portion. The decomposition is dynamic — the model decides how to partition based on what it observes, not a predefined chunking strategy.
Results: RLMs handle inputs up to two orders of magnitude beyond model context windows. On shorter prompts (within context limits), RLMs still dramatically outperform base models and common long-context scaffolds including context compaction. The cost is comparable or cheaper per query because the model processes only the relevant portions of context rather than attending to everything.
This connects to Can models precompute answers before users ask questions? as a second reframing of compute allocation: sleep-time asks WHEN to compute (before vs during query); RLMs ask WHERE to keep the data (model's context vs external environment). Both reject the default of "stuff everything into the context window at query time."
Source: Inference time scaling
Related concepts in this collection
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
parallel reframing: sleep-time is temporal (when to compute), RLMs are spatial (where to keep data); both reject the default context-stuffing approach
-
How should we categorize different test-time scaling approaches?
Test-time scaling research spans multiple strategies for improving model performance at inference. Understanding how these approaches differ—and how they relate—helps researchers and practitioners choose the right method for their constraints.
RLMs are a novel form of external TTS: compute spent on environmental interaction rather than model-internal reasoning
-
Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
RLMs address this directly by offloading context to environment; the model only attends to relevant fragments
-
Can long-context models resolve retriever-reader imbalance?
Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
RLMs take the opposite approach: shift burden to retrieval (code-based context probing) rather than reading (attention over everything)
-
When should AI systems do their thinking?
Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
parallel temporal reframing: sleep-time asks WHEN to compute, RLMs ask WHERE to keep data; both reject the assumption that all processing must happen inside the context window at query time
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
recursive language models treat long prompts as external environment enabling programmatic interaction 100x beyond context windows