How should multimodal agents organize their memory?
Can organizing agent memory around entities and separating episodic events from semantic knowledge enable more natural, preference-aware assistance without constant clarification?
M3-Agent (2508.09736) proposes a multimodal agent framework where long-term memory is organized as an entity-centric graph, with two types of memory generated from continuous video-stream perception:
Episodic memory records concrete events: "Alice takes the coffee and says, 'I can't go without this in the morning.'" Semantic memory derives general knowledge: "Alice prefers to drink coffee in the morning." Information about the same entity — face, voice, textual knowledge — is connected in graph format, incrementally established as the agent extracts and integrates semantic memory.
The architecture runs two parallel processes: (1) memorization, which continuously perceives real-time multimodal inputs to construct and update long-term memory; and (2) control, which interprets external instructions, reasons over stored memory, and executes tasks. This dual-process design means the agent can hand you coffee without asking "coffee or tea?" — it has already formed a memory of your preferences through observation.
The entity-centric graph structure is the key architectural choice. Unlike flat memory stores or conversation-history retrieval, entity-centric organization enables cross-modal association: a person's face links to their voice links to their preferences. This mirrors how Does abstract preference knowledge outperform specific interaction recall? — but M3-Agent captures both episodic and semantic layers and connects them through entity nodes rather than discarding one.
The dual episodic/semantic distinction also echoes the hierarchical knowledge source in Can reasoning systems maintain memory across multiple retrieval cycles?, where ComoRAG builds veridical, semantic, and episodic layers — but M3-Agent applies this to continuous multimodal perception rather than text retrieval.
Since How should agents decide what memories to keep?, M3-Agent's memorization process operates as continuous implicit memory — always running, always extracting, rather than waiting for explicit recognition of importance.
Source: Memory
Original note title
multimodal agents require entity-centric memory graphs that separate episodic events from semantic knowledge — parallel memorization and control processes mirror human cognitive architecture