Do text-based GUI agents actually work in the real world?
Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.
ShowUI's framing critique is that the dominant GUI agent paradigm — language-based agents calling closed-source APIs with text-rich meta-information like HTML or accessibility tree — assumes oracle access that real-world deployment does not have. Users interact with interfaces visually through screenshots, without the underlying structural information that text-based agents depend on. The text-only approach is therefore architecturally limited regardless of model scale.
But GUI visual perception is not a problem natural-image MLLMs solve well. UI tasks need specialized capabilities — element grounding, action execution — rather than the conversational abilities multimodal chatbots are tuned for. ShowUI proposes three innovations addressing the resulting gaps.
UI-Guided Visual Token Selection treats screenshots as UI-connected graphs and adaptively identifies redundant relationships, using these as criteria for token selection during self-attention. This reduces compute by exploiting that screenshots are not natural images — large portions are visually redundant (background, repeated elements) and the connectivity structure of UI components encodes which tokens carry information.
Interleaved Vision-Language-Action Streaming unifies diverse needs within GUI tasks — managing visual-action history during navigation, pairing multi-turn query-action sequences per screenshot to enhance training efficiency. Treating vision, language, and action as a single interleaved stream is more flexible than the staged pipelines that dominate prior work.
Small-scale High-quality GUI Instruction Datasets result from careful curation and resampling against type imbalance — the data-side intervention that lets the architectural innovations actually train.
The implication is that GUI visual agents are not a special case of multimodal models — they are a domain where the visual prior, the action vocabulary, and the data distribution all need to be UI-shaped from the start. This is the strong end-to-end position that creates a tension with the perception-factoring camp: Why do vision-only GUI agents struggle with screen interpretation? and Can structured interfaces help language models control GUIs better? both keep the foundation MLLM general-purpose and add a structured perception layer. ShowUI argues the perception layer should be inside the model, made UI-shaped end-to-end. The two camps disagree on whether to factor perception out or train it in.
Source: Visual GUI Agents
Related concepts in this collection
-
Why do vision-only GUI agents struggle with screen interpretation?
Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.
contradicts: OmniParser argues factor perception OUT of the foundation model with a pre-processing parser; ShowUI argues build perception IN with UI-specialized VLA models. Same problem, opposite architectural answer.
-
Can structured interfaces help language models control GUIs better?
Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.
tension with: Agent S uses accessibility tree as a grounding aid alongside vision; ShowUI argues accessibility-tree dependence is the architectural ceiling that must be removed for real-world deployment.
-
Can unlabeled UI video teach models what users intend?
Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.
complements: UI-JEPA is the self-supervised pretraining recipe that ShowUI's UI-specialized VLA approach depends on — UI-shaped perception needs UI-shaped pretraining.
-
Why do planning and grounding pull against each other in agents?
Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?
complicates: AutoGLM's intermediate-interface argument depends on factoring; ShowUI suggests the factoring may be a sub-optimal compromise that better UI-shaped models will eventually obviate.
-
Do generated interfaces outperform text-based chat for most tasks?
Explores whether LLMs should create interactive UIs instead of text responses, and under what conditions users prefer dynamic interfaces to traditional conversational chat.
connects: if interfaces become generative and dynamic, the case for UI-shaped end-to-end vision strengthens — accessibility trees won't exist for novel generated interfaces.
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
text-based GUI agents using HTML or accessibility trees miss what humans actually see — visual perception is required for real-world deployment but demands UI-specialized vision-language-action models