AutoGLM: Autonomous Foundation Agents for GUIs

Paper · arXiv 2411.00820 · Published October 28, 2024

While foundation models excel at acquiring human knowledge, they often struggle with decision-making in dynamic real-world environments, limiting their progress toward artificial general intelligence. This limitation underscores the importance of developing foundation agents capable of learning through autonomous environmental interactions by reinforcing existing models. Focusing on Web Browser and Phone as representative GUI scenarios, we have developed AUTOGLM as a practical foundation agent system for real-world GUI interactions. Our approach integrates a comprehensive suite of techniques and infrastructures to create deployable agent systems suitable for user delivery. Through this development, we have derived two key insights: First, the design of an appropriate "intermediate interface" for GUI control is crucial, enabling the separation of planning and grounding behaviors, which require distinct optimization for flexibility and accuracy respectively. Second, we have developed a novel progressive training framework that enables self-evolving online curriculum reinforcement learning for AUTOGLM.

However, the development of foundation agents for GUI faces a critical challenge: the scarcity of decision-making data in existing pre-training sets. While the internet contains vast human knowledge, it primarily consists of static information that inadequately captures human decision-making and environmental interaction. Building capable foundation agents requires enriching them with dynamic knowledge, either through direct interaction with real-world environments or through learning from synthesized trajectories. Such foundation agents can then self-evolve in the digital world, iteratively improving to achieve genuine general intelligence.

Crucially, these systems must be developed with progressive user deployment in mind. Autonomous agents are designed to augment, not replace, human capabilities. User deployment serves the dual purpose of teaching agents effective human assistance while allowing humans to adapt to intelligent assistants. This approach also enables researchers to systematically understand, discover, and examine both the potential benefits and risks of autonomous foundation agents during development.

In response to these opportunities and challenges, we introduce AUTOGLM, a series of foundation agents built upon the ChatGLM [11] model family. AUTOGLM represents a pioneering attempt to develop foundation agent prototypes for two fundamental GUI scenarios: Web Browser and Android. To address the data scarcity challenge, we employ a comprehensive suite of training techniques and develop key infrastructures for user deployment. This process has yielded two crucial insights:

• Intermediate Interface Design: We find it essential to design an intermediate interface that disentangles planning and grounding behaviors in foundation GUI agents. They present distinct requirements – planning demands flexibility and error recovery, while grounding emphasizes action accuracy. Their separation enables more agile development and enhanced performance.

• Self-Evolving Online Curriculum RL [30]: We recognize that error recovery [23] is crucial for robust and deployable agent applications, yet it remains difficult to acquire through offline training alone. Additionally, the shortage of instructions and trajectories impedes training progress. We address this challenge through self-evolving RL, implemented according to a progressive weak-to-strong curriculum schedule in an online manner.