Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors

Paper · arXiv 2509.13237 · Published September 16, 2025

Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and latency. This saturation of the context window leaves less capacity for exploration. We study a simple mechanism that converts recurring reasoning fragments into concise, reusable “behaviors” (name + instruction) via the model’s own metacognitive analysis of prior traces. These behaviors are stored in a “behavior handbook” which supplies them to the model in-context at inference or distills them into parameters via supervised fine-tuning. This approach achieves improved test-time reasoning across three different settings - 1) Behavior-conditioned inference: Providing the LLM relevant behaviors in-context during reasoning reduces number of reasoning tokens by up to 46% while matching or improving baseline accuracy; 2) Behavior-guided self-improvement: Without any parameter updates, the model improves its own future reasoning by leveraging behaviors from its own past problem solving attempts. This yields up to 10% higher accuracy than a naive critique-and-revise baseline; and 3) Behavior-conditioned SFT: SFT on behavior-conditioned reasoning traces is more effective at converting non-reasoning models into reasoning models as compared to vanilla SFT. Together, these results indicate that turning slow derivations into fast procedural hints enables LLMs to remember how to reason, not just what to conclude.

LLMs have made rapid progress on mathematics, coding and other multi-step tasks by generating long, deliberative chains-of-thought (Wei et al., 2022; Guo et al., 2025; Shao et al., 2024; OpenAI, 2024; Muennighoff et al., 2025; Ye et al., 2025; Gao et al., 2024; Lambert et al., 2024; Team et al., 2025). Yet, this capability exposes a structural inefficiency: each new problem triggers reconstruction of ubiquitous sub-procedures (e.g., finite-series sums, case splits, unit conversions), inflating token usage and latency. For instance, suppose the LLM derives the finite geometric series formula while solving one problem. Can it avoid re-deriving from scratch when similar reasoning is needed for another problem? Current inference loops lack a mechanism to promote frequently rediscovered patterns into a compact, retrievable form.

We introduce a metacognitive pathway that extracts and reuses such patterns. Given a problem, the model first solves it, then reflects on its trace to identify generalizable steps, and finally emits a set of behaviors—short, actionable instructions with canonical names. These behaviors populate a searchable handbook (a procedural memory) that can be provided in-context at test time or internalized through supervised fine-tuning. This provides a framework for turning verbose derivations into quick reflexes.

Unlike typical memory/Retrieval-Augmented Generation (RAG) systems that store declarative facts, the handbook targets procedural knowledge (Willingham et al., 1989) about how to think. This procedural memory contrasts sharply with most existing “memory” add-ons for LLMs, including RAG, which target declarative knowledge for tasks such as factual question-answering (Borgeaud et al., 2022; Lewis et al., 2020). Instead of being assembled from curated documents or knowledge graphs—rather, it is generated by the model itself. It emerges from the model’s own metacognitive cycle: critiquing its own chain-of-thought and abstracting repeated reasoning patterns into behaviors.

We evaluate three instantiations of the proposed framework. (i) Behavior-conditioned inference: Providing behaviors obtained by solving questions in-context results in reasoning chains that utilize up to 46% fewer tokens while improving or maintaining strong performance across MATH and AIME benchmarks (ii) Behaviorguided self-improvement: While solving a problem, providing the model access to behaviors extracted by itself from its own past attempts for that question improves accuracy by up to 10% compared to naive self-improvement baseline. (iii) Behavior-conditioned SFT: training on reasoning traces generated via behaviorconditioned inference yields models that are both more accurate and more concise than models trained on ordinary traces, especially when turning non-reasoning models into reasoning models.