ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision language- action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale Highquality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.
Graphical User Interfaces (GUIs) are central to how individuals engage with the digital world, serving as virtual embodied interface for a range of daily activities. Meanwhile, Large Language Models (LLMs) [32], with their ability to comprehend complex language instructions and seamlessly integrate tools, have shown significant potential in performing complex tasks through building agents [1, 13, 16, 56]. This progress inspires the development of intelligent GUI agents that can significantly streamline human workflows based on user intentions
However, the text-only approach is limited in real-world applications, where users typically interact with user interfaces visually—through screenshots—without access to the underlying structural oracle information
However, GUI visual perception presents unique challenges compared to natural image processing, requiring specialized skills such as UI element grounding or action execution rather than the conversational abilities typical of multi-modal chatbots. Recognizing this gap, researchers have begun training vision-language models to acquire these new abilities.
We found in a lot of cases where only inputting the UI screenshot overlayed with bounding boxes and associated IDs can be misleading to GPT-4V. We argue the limitation stems from GPT-4V’s constrained ability to simultaneously perform the composite tasks of identifying each icon’s semantic information and predicting the next action on a specific icon box. This has also been observed by several other works [YYZ+23, ZGK+24]. To address this issue, we incorporate the local semantics of functionality into the prompt, i.e. for each icons detected by the interactable region detection model, we use a finetuned model to generate description of functionality to the icons, and for each text boxes, we use the detected texts and its label.