Agentic and Multi-Agent Systems

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Note · 2026-05-03 · sourced from Tool Computer Use

Agent S's contribution is conceptual as much as engineering: it ports the Agent-Computer Interface (ACI) idea from coding agents to GUI agents. The motivating observation is that MLLMs handed raw screenshots are asked to do too much at once — identify icon semantics and predict the next action on a specific element simultaneously — which is observationally where they fail.

The ACI is therefore designed to factor the problem. The dual-input strategy uses visual input for understanding environmental changes (what the screen looks like, what just happened) while pairing it with an image-augmented accessibility tree for precise element grounding (which element is which, and where). The action space is bounded to language-based primitives like click(element id) — narrow enough to be reliably common-sense reasonable for an MLLM, broad enough to compose into complex tasks, and at a temporal resolution that lets the agent observe immediate task-relevant feedback after each action.

This factoring matches a deeper architectural choice: planning and grounding have distinct optimization requirements. Planning needs flexibility and error recovery. Grounding needs accuracy. Mixing them in a single end-to-end policy means each pulls against the other (see Why do planning and grounding pull against each other in agents?). The ACI's job is to be the abstraction layer that lets each concern be optimized separately.

Empirically the design pays off — 9.37% absolute gain over the OSWorld baseline, plus generalization across operating systems on WindowsAgentArena. The transferable claim is that "look at the screen and act" is the wrong primitive for GUI agents at the current model frontier. The right primitive is a structured interface that hands the model what each cognitive sub-task actually needs.


Source: Tool Computer Use

Related concepts in this collection

Concept map
12 direct connections · 77 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

GUI agents need a language-centric Agent-Computer Interface to separate planning from grounding — visual understanding plus accessibility tree plus bounded primitives beats raw screenshots