How can GUI agents adapt when software constantly changes?
Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.
The challenge Agent S targets is that GUI automation must work across a vast and constantly evolving universe of applications and websites. No fixed knowledge base survives — the agent must learn from open-world experience while still benefiting from domain-specific specialization. The proposed architecture answers this with a three-source planning method.
External: Online Web Knowledge provides up-to-date documentation about specific applications, allowing adaptation to software that has changed since training. This is the "look it up" channel — useful precisely because the open world drifts.
Internal-abstract: Narrative Memory stores high-level, abstractive task experiences from past interactions — the gestalt of how a kind of task plays out, used during top-level decomposition. Internal-concrete: Episodic Memory stores detailed, step-by-step subtask experience — retrieved during execution to refine specific actions in context.
The two-tier internal memory matters because complex desktop tasks span timescales: high-level decomposition needs abstract task patterns, but low-level execution needs concrete state-action sequences. Successful subtasks and full task experiences are evaluated by a self-evaluator and stored back, enabling continual improvement.
The differentiation from prior RAG-for-agents work is precise: rather than retrieving exemplars or guidelines uniformly, this design uses task experience hierarchically — full task experience summarized into abstractive textual reward for subtask planning, subtask experience self-evaluated before storage. The implication is that GUI agents in open worlds need more than memory; they need stratified memory whose levels match the levels of the planning problem. The same paper introduces the Can structured interfaces help language models control GUIs better? as the perception-side companion to this memory architecture — together they illustrate that GUI agents need factoring at both perception and memory layers.
Source: Tool Computer Use
Related concepts in this collection
-
Can structured interfaces help language models control GUIs better?
Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.
complements: same paper, perception-side companion. ACI factors planning vs grounding; this note factors abstract vs concrete memory.
-
How should multimodal agents organize their memory?
Can organizing agent memory around entities and separating episodic events from semantic knowledge enable more natural, preference-aware assistance without constant clarification?
extends: M3-Agent splits episodic vs semantic; Agent S splits narrative (gestalt patterns) vs episodic (step-level traces) — both argue memory must be stratified by abstraction level.
-
Can reasoning systems maintain memory across multiple retrieval cycles?
Does integrating evidence across iterative retrieval steps—rather than treating each step independently—help systems resolve contradictions and build coherent understanding in complex narratives?
complements: ComoRAG's veridical/semantic stratification mirrors Agent S's narrative/episodic split — both target hierarchical memory for long-horizon problems.
-
Does state-indexed memory outperform high-level workflow memory for web agents?
Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.
tension with: Agent S includes both narrative (high-level) and episodic (step-level) memory; PRAXIS argues only state-action level matters and high-level workflow abstractions hurt. Agent S would predict its episodic layer suffices and the narrative layer is redundant for web execution — a testable disagreement.
-
Why do planning and grounding pull against each other in agents?
Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?
extends: AutoGLM generalizes the planning-vs-grounding factoring; Agent S provides the memory-side instantiation matched to that factoring.
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
experience-augmented hierarchical planning combines external web knowledge with narrative and episodic memory — letting GUI agents adapt to open-world software change