OMNI-SIMPLEMEM: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

Paper · arXiv 2604.01007 · Published April 1, 2026
Autonomous AgentsTraining Fine TuningReinforcement LearningSelf Refinement Self Consistency FeedbackEvolution

AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover OMNI-SIMPLEMEM, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1 = 0.117 on LoCoMo), the pipeline autonomously executes ∼50 experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo and +214% on Mem-Gallery relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML.We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains. Code is available at this https://github.com/aiming-lab/SimpleMem.

Existing approaches to agent memory fall into two broad categories, each with notable limitations. The first stores raw inputs and retrieves them via embedding similarity (Lewis et al., 2020; Borgeaud et al., 2022), suffering from storage bloat and retrieval noise as the memory grows. The second introduces structured memory management with explicit operations (Packer et al., 2023; Park et al., 2023), but typically operates on text alone, discarding rich visual and auditory signals. Crucially, both categories are products of manual research cycles: a human researcher hypothesizes an improvement, implements it, evaluates on a benchmark, and iterates. A single researcher may explore only a handful of configurations per day, and important interactions between tightly coupled components are easily missed. Traditional AutoML methods (Hutter et al., 2019a) can search over predefined numerical hyperparameter spaces, but cannot perform the code comprehension, bug diagnosis, architectural redesign, and cross-component reasoning that account for the largest performance gains in complex systems. As a result, existing memory systems inherit the blind spots of their designers–limitations that a more systematic search could avoid.

Recent work on autonomous scientific discovery (Lu et al., 2024; Romera-Paredes et al., 2024; Panfilov et al., 2026) has shown that LLM agents can autonomously discover novel algorithms that outperform human-designed baselines, provided the target domain admits well-defined, quantitative evaluation signals. We ask whether this paradigm extends to complex, multi-component AI systems and answer affirmatively. We deploy AUTORESEARCHCLAW (Liu et al., 2026b), a 23-stage autonomous research pipeline, to discover OMNI-SIMPLEMEM, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1 = 0.117 on LoCoMo), the pipeline autonomously executes ∼50 experiments across two benchmarks, iteratively diagnosing failure modes, proposing architectural modifications, repairing data pipeline bugs, and validating improvements, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.117→0.598) and +214% on Mem-Gallery (0.254→0.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML.

Among the pipeline’s most consequential discoveries are three architectural principles that define OMNI-SIMPLEMEM. First, selective ingestion: lightweight perceptual encoders measure the information novelty of each incoming signal and discard redundant content before storage, significantly reducing storage requirements. Second, unified representation: all memories, regardless of modality, are represented as Multimodal Atomic Units (MAUs) that separate lightweight metadata from heavy raw data, enabling fast search over compact metadata while preserving full-content access on demand. Third, progressive retrieval: a pyramid mechanism expands information in three stages (summaries, details, raw evidence), each gated by a token budget, backed by a hybrid search strategy combining dense vector retrieval with sparse keyword matching via set-union merging, a strategy autonomously discovered by the pipeline. Our key observation is that multimodal memory is particularly well-suited for autonomous research pipelines due to four properties: immediate scalar evaluation metrics enabling tight optimization loops, modular architecture allowing isolated component modification, fast iteration cycles (1–2 hours per experiment) supporting dozens of hypotheses within days, and version-controlled code modifications allowing failed experiments to be cleanly reverted.