Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Imagine that in the future a household robot can autonomously carry out household tasks without your explicit instructions; it must have learned the operational rules of your home through daily experiences. In the morning, it hands you a cup of coffee without asking "coffee or tea?", because it has gradually formed a memory of you, tracking your preferences and routines through long-term interaction. For a multimodal agent, achieving such a level of intelligence fundamentally relies on three capabilities: (1) continuously perceiving the world through multimodal sensors; (2) storing its experiences in long-term memory and gradually building knowledge about the environment; (3) reasoning over this accumulated memory to guide its actions.
To achieve the goals, we propose M3-Agent, a novel multimodal agent framework equipped with long-term memory. As shown in Figure 1, it operates through two parallel processes: memorization, which continuously perceives real-time multimodal inputs to construct and update long-term memory; and control, which interprets external instructions, reasons over the stored memory, and executes the corresponding tasks. During memorization, M3-Agent processes the incoming video stream, capturing both fine-grained details and high-level abstractions by generating two types of memory, analogous to human cognitive systems [42, 43]:
• Episodic memory: Records concrete events observed within the video. For example, "Alice takes the coffee and says, ‘I can’t go without this in the morning,’" and "Alice throws an empty bottle into the green garbage bin."
• Semantic memory: Derives general knowledge from the clip. For example, "Alice prefers to drink coffee in the morning" and "The green garbage bin is used for recycling."
The generated memories are then stored in long-term memory, which supports multimodal information such as faces, voices and textual knowledge. Moreover, the memory is organized in an entity-centric structure. For example, information related to the same person (e.g., their face, voice and associated knowledge) is connected in a graph format, as shown in Figure 1. These connections are incrementally established as the agent extracts and integrates semantic memory.