INQUIRING LINE

Can episodic and semantic memory improve long-horizon task reasoning?

This explores whether giving models two kinds of memory — episodic (records of specific past attempts) and semantic (distilled general knowledge) — actually helps them reason over long, multi-step tasks, rather than just stuffing more into the context window.


This reads the question as asking whether two distinct kinds of memory — episodic (the model's own concrete past attempts) and semantic (distilled, transferable knowledge) — can help a model hold a long, multi-step task together. The corpus doesn't have a paper that uses those exact labels, but it has something better: work that quietly draws the same line and shows the distinction matters.

The sharpest example is SkillRL Should successful and failed episodes be processed differently?, which treats successful and failed episodes differently — successes are kept as concrete demonstrations (episodic), failures are abstracted into lessons (semantic). That asymmetry isn't cosmetic: it hits state-of-the-art on complex tasks while using far less context than uniform approaches that consolidate everything the same way. The takeaway worth carrying: *how* you store a memory matters more than *how much* you store. A pile of raw trajectories degrades; differentiated memory compounds.

Why storing less can help is explained by a problem you might not expect — long inputs hurt reasoning well before the context window fills. Accuracy drops from 92% to 68% with just a few thousand tokens of padding, independent of the task and unfixed by chain-of-thought Does reasoning ability actually degrade with longer inputs?. So naive episodic memory (dump every past step back into the prompt) can actively backfire. This reframes memory design as a compression problem: the value of semantic abstraction is partly that it keeps the working context short enough to reason in at all.

The episodic/semantic split also mirrors a deeper finding about where reasoning ability comes from. Analysis of millions of pretraining documents shows reasoning generalizes from broad *procedural* knowledge — transferable how-to patterns — while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. That's the same episodic-vs-semantic axis one level down: concrete instances support recall, abstracted procedure supports reasoning. It suggests that for long-horizon tasks the semantic side of memory is doing the heavy lifting, with episodic instances as grounding.

Two adjacent doorways round this out. Lookahead tokens Can embedding future information in training data improve planning? show planning can be improved by embedding future-goal information into training data — a kind of memory-of-the-destination that helps a model stay on course over many steps without architecture changes. And cognitive tools Can modular cognitive tools unlock reasoning without training? show that isolating reasoning operations into modular, sandboxed calls unlocks latent ability — an externalized, structured form of working memory. The thread across all of these: long-horizon reasoning improves less from raw recall and more from memory that's been *shaped* — abstracted, goal-conditioned, or modularized — before it re-enters the model's limited reasoning window.


Sources 5 notes

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can embedding future information in training data improve planning?

TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Next inquiring lines