INQUIRING LINE

What update rules should govern dialogue-scoped versus turn-scoped memory?

This explores how memory that spans a whole conversation should be updated differently from memory that lives inside a single turn — and what the corpus says about matching update rules to each timescale.


This explores how memory that spans a whole conversation should be updated differently from memory scoped to a single turn. The cleanest map in the corpus comes from RAISE, which splits agent working memory into four parts across two timescales: dialogue-level pieces (the running conversation history and a scratchpad) versus turn-level pieces (the examples and trajectory for the task at hand). The reason this distinction matters isn't tidiness — it's that each granularity has its own failure mode, so each needs its own update rule How should agent memory split across time scales?.

The danger at the dialogue scope is premature commitment. When information arrives gradually, models lock onto an early guess and can't course-correct — accuracy falls from ~90% on single-shot instructions to ~65% across a natural multi-turn conversation, and agent mitigations claw back only 15–20% of the loss Why do language models fail in gradually revealed conversations? Why do AI assistants get worse at longer conversations?. The implication for update rules: dialogue-scoped memory should stay revisable rather than write-once. The 20-questions regeneration test reinforces this from another angle — an LLM holds a superposition of possible characters and only samples one at generation, so treating any single turn's interpretation as a fixed fact about the whole dialogue is a mistake the architecture invites Do large language models actually commit to a single character?.

The danger at the turn scope is the opposite: not over-commitment but over-spending. Unrestricted reasoning inside one turn burns the context budget that later retrieval rounds need, quietly degrading the whole session. The fix is a per-turn cap, not just an overall time limit — a hard update rule that bounds what any single turn is allowed to consume Does limiting reasoning per turn improve multi-turn search quality?. So the two scopes pull in different directions: dialogue memory wants to resist locking in, turn memory wants to resist sprawling out.

For the dialogue scope, the corpus also argues against the obvious update rule of "rewrite and compress." The ACE framework treats long-lived context as an evolving playbook updated through small generation–reflection–curation edits rather than full rewrites, specifically because wholesale compression causes brevity bias and detail erosion Can context playbooks prevent knowledge loss during iteration?. DeepAgent's memory folding makes the complementary point: consolidation works when it folds history into structured episodic, working, and tool schemas — the structure is what lets compression happen without degradation Can agents compress their own memory without losing critical details?. And one provocative finding reframes the whole problem: the real bottleneck in long-context memory may not be capacity but the compute needed to transform evicted context into durable internal state, with quality improving as you spend more consolidation passes Is long-context bottleneck really about memory or compute?.

The thing you might not have known you wanted to know: a good chunk of "memory" behavior is actually a missing training signal, not a missing update rule. Models learn what-to-do instructions but not what-to-ignore instructions, and fine-tuning on barely a thousand dialogues with distractor turns sharply improves their ability to hold a topic across a conversation Why do language models engage with conversational distractors?. In other words, dialogue-scoped memory partly governs itself through a learned filter on what deserves to persist — the update rule and the training objective are two views of the same lever.


Sources 9 notes

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Why do language models engage with conversational distractors?

Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.

Next inquiring lines