UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Paper · arXiv 2409.04081 · Published September 6, 2024
Tool Computer Use

Inspired by the success of self-supervised learning (SSL) techniques like Joint Embedding Predictive Architectures (JEPA) [10] and its variants [3, 5], we propose UI-JEPA, a lightweight video-to-text model specialized for UI activities. UI-JEPA comprises a JEPA-based encoder, trained with novel temporal masking strategies on unlabeled UI video data, to learn abstract feature representations, and an LLM decoder that predicts user intent from these features. Our key insight, inspired by the Predictive Feature Principle [15], is that predicting fully masked frames using unmasked frames allows the model to effectively capture temporal relationships and understand task meanings. We demonstrate that fine-tuning an LLM decoder conditioned on UI-JEPA representations requires only a fraction of the paired video-text data and computational resources as required by state-of-the-art MLLMs. This framework is particularly valuable when high-quality labeled data are scarce.

Various machine learning models have been proposed to enhance UI understanding. Early efforts [4, 8] primarily focused on pretraining transformer-based models using large-scale, unlabeled UI data to learn generic feature representations at the UI component level. Other approaches [24] have augmented model training with semantic information and heuristics to improve UI detection. However, these methods often fall short in understanding the concept of a task and fail to learn comprehensive visual representations, as they are limited to individual UI components. Some approaches [22] utilize crawler models tailored to specific tasks, but these models struggle with scalability across a large number of tasks and exhibit limited generalization to unseen tasks. Additionally, methods [23] that integrate image encoders with LLMs are generally confined to basic UI tasks, such as icon recognition and widget listing, and operate on static images, which hinders their ability to learn temporal relationships and the concept of a task.

In contrast, our approach processes videos that capture sequences of UI actions during task execution. By using a JEPA-based encoder, we learn video representations through self-supervised learning, and an LLM decoder to textual representations of the user intents. This method captures not only the temporal dynamics of UI interactions, but also offers a more holistic understanding of user tasks.