Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

Paper · arXiv 2507.16784 · Published July 22, 2025

To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM1), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN2 supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in (Schroeder et al., 2025). During generation, we maintain a working memory that retains only the key/value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long horizon reasoning and multi-hop tool use. More details can be found via https: //github.com/subconscious-systems/TIMRUN.

Large language models (LLMs) have emerged as versatile foundations for a wide range of AI applications, especially agents which handle complicated tasks including multi-hop reasoning and tool use. Their ability to generalize across various tasks with minimal fine-tuning has driven rapid innovation and broad adoption (Brown et al., 2020). However, the fundamental objective of language modeling, to generate unstructured token sequences (Bengio et al., 2003), imposes strict context window limits and makes fine-grained control over internal state difficult. As a result, these inherent constraints pose significant challenges for all state-of-the-art LLMs, notably their inability to maintain long-horizon reasoning trajectories and coordinate complex workflows, which hinders the development of robust, memory-intensive applications.

To work around the working memory bottleneck. developers frequently partition complex workflows into multiple modules (namely multi-agent architecture), each backed by a separate model instance that is responsible for distinct subtasks. Multi-agent frameworks (Li et al., 2023; Hong et al., 2024; Wu et al., 2024) facilitate such workflows by dividing problems into tractable units. Domain-focused workflows demonstrate the power of agent societies in highly specialized settings with strong prior knowledge and well-defined scope. However, these multi-agent designs introduce significant overhead while dealing with more arbitrary tasks since agents do not inherently manage control flow or coordination, leaving developers to hand-craft context management, exception handling, and inter-agent communication. Moreover, integrating external tools further compounds complexity. Parameter generation, tool calling, and tool response processing are usually handled by different modules, inflating both development effort and runtime latency.

We believe that reasoning is not a linear process; it is recursively structured with inner dependencies, just like language (Aho & Ullman, 1972), hinted by many real-life experiences. For example, in programming tasks, we often focus on the lines around the cursor, recall the inputs and outputs of the functions we have completed, and keep TODOs in mind. We no longer memorize all the details of a completed function, since our subconscious brain has flushed that information out of the working memory to help us focus on the current task. Inspired by this observation, we propose a new perspective to avoid the context and representation bottlenecks faced by traditional neural language models. We model a reasoning trajectory as a recursive tree of subtasks. While higher level nodes in the tree receive tasks that require extensive multi-hop reasoning and tool use, the tree keeps decomposing complex instructions into simpler subtasks until reaching a leaf node, which represents a straightforward task that can be completed within one step. Our hypothesis is that processing an intermediate task does not have to attend to the subtasks of previous steps.