LLM Reasoning and Architecture Agentic and Multi-Agent Systems

Can recursive subtask trees overcome context window limits?

Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.

Note · 2026-02-23 · sourced from Memory
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Thread Inference Model (TIM) starts from the observation that reasoning is not linear — it is recursively structured with inner dependencies, like language itself. Programming provides the intuition: you focus on lines around the cursor, recall inputs/outputs of completed functions, keep TODOs in mind, but don't memorize all details of a completed function. Your brain flushes resolved subproblems to focus on the current task.

TIM models reasoning trajectories as recursive trees of subtasks. Higher-level nodes receive complex instructions requiring multi-hop reasoning and tool use. The tree decomposes until reaching leaf nodes — straightforward tasks completable in one step. The key hypothesis: processing an intermediate task does not need to attend to the completed subtasks of previous steps.

The working memory mechanism: a KV cache management system that retains only the key/value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism. When a subtask completes, its detailed KV states are pruned from working memory — only its conclusion is retained for the parent task. This enables:

The system sustains high inference throughput even when manipulating up to 90% of the KV cache. This is not a theoretical bound — the experimental results demonstrate accurate reasoning on mathematical tasks and information retrieval requiring long-horizon multi-hop tool use.

This addresses the multi-agent overhead problem directly. Since current LLM context limits force developers to partition complex workflows into multi-agent architectures (each backed by a separate model instance), TIM enables a single model to handle the full recursive reasoning internally. The coordination cost, exception handling, and inter-agent communication overhead of multi-agent designs are eliminated.

Since Can parallel architectures solve fundamentally sequential problems? argues some problems fundamentally require sequential depth, TIM provides a mechanism for achieving that depth without context window constraints. And since Can reasoning topologies be formally classified as graph types?, TIM's recursive trees are a concrete implementation of tree-of-thought reasoning where the branching is driven by task decomposition and the pruning is driven by completion.


Source: Memory

Related concepts in this collection

Concept map
13 direct connections · 139 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reasoning modeled as recursive subtask trees with KV cache pruning enables unlimited working memory beyond context limits