Should GUI agents use intermediate structured representations instead of raw pixels?
This explores whether agents that operate software interfaces should work from parsed, semantic descriptions of the screen rather than directly from raw screenshots — and the corpus suggests the more interesting answer is that the best systems don't choose, they layer.
This explores whether GUI agents should read intermediate structured representations instead of raw pixels — and the corpus leans clearly toward "yes, structure helps," while complicating *why*. The core finding is that asking a vision-language model to do two jobs at once — figure out what each icon means *and* decide what to click — overloads it. OmniParser showed GPT-4V fails on raw screenshots precisely because of this composite burden; pre-parsing the screen into labeled semantic elements lets the model spend all its effort on the actual decision Why do vision-only GUI agents struggle with screen interpretation?. So the case for structure isn't really about pixels-versus-text; it's about *separating tasks that fight each other when fused*.
That separation principle is where the corpus gets interesting, because multiple independent systems converged on it. Agent S, AutoGLM, and OmniParser all landed on splitting an agent into a *planning* layer and a *grounding* layer, mediated by a language-centric interface — because planning and grounding have opposing optimization needs and shouldn't be jammed into one end-to-end prediction How should agents split planning from visual grounding? Can structured interfaces help language models control GUIs better?. The structured representation (accessibility trees, parsed elements) is essentially the seam that lets each layer be optimized on its own terms. Adrian-style: the win isn't "text beats pixels," it's "give the model one thing to think about at a time."
But there's a sharp counter-voice. ShowUI argues that text-based representations like HTML and accessibility trees *miss what humans actually see* on screen, and that real interface navigation needs purpose-built vision-language-action models — not general multimodal models bolted onto a parser Do text-based GUI agents actually work in the real world?. So structure can also throw away information. The reconciliation most of these systems reach is *both/and*: Agent S feeds visual input for environmental understanding *plus* image-augmented accessibility trees for grounding, rather than picking one Can structured interfaces help language models control GUIs better?.
Here's the thing you didn't know you wanted to know: the most radical answer in the corpus is to skip the GUI entirely. The AXIS framework shows that when agents call APIs instead of clicking through interfaces, task completion time drops 65–70% while accuracy stays near 98% — and the system can auto-discover APIs hidden inside existing apps Can API-first agents outperform UI-based agent interaction?. A GUI is, after all, a representation designed for human eyes and hands. If an agent doesn't have those constraints, the screenshot itself may be the unnecessary intermediate layer. This reframes the whole question: "structured representation vs. pixels" is a debate that only matters once you've decided the agent must go through the GUI at all.
If you want to zoom out further, the GUI debate is one instance of a broader pattern in agent design: reliability tends to come from *externalizing* hard sub-problems into structured scaffolding rather than asking a bigger model to solve everything internally — memory, skills, and interaction protocols pushed into a harness layer Where does agent reliability actually come from?. The parsed screen is exactly this move applied to perception: don't make the model re-derive the interface every step; hand it structure. That's the deeper reason intermediate representations keep winning.
Sources 6 notes
OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.
Multiple independent systems (Agent S, AutoGLM, OmniParser) converged on factoring agent reasoning into a planning layer and a grounding layer, with a language-centric Agent-Computer Interface mediating between them due to their opposing optimization requirements.
Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.
ShowUI demonstrates that GUI agents need end-to-end vision-language-action models with UI-aware token selection and interleaved streaming, not adapted general-purpose MLLMs. Standard multimodal models lack the grounding and action capabilities real interface navigation demands.
The AXIS framework shows that prioritizing API calls over sequential UI interactions cuts task completion time by 65–70% while maintaining 97–98% accuracy and reducing cognitive workload by 38–53%. A self-exploration mechanism automatically discovers and constructs APIs from existing applications, solving the bootstrapping problem.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.