Design & LLM Interaction LLM Reasoning and Architecture Conversational AI Systems

Can unlabeled UI video teach models what users intend?

Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.

Note · 2026-05-03 · sourced from Tool Computer Use

UI-JEPA argues that prior UI-understanding approaches misframe the problem at two levels. Pretrained UI transformers operate at the component level and miss the concept of a task. Image-encoder-plus-LLM systems handle static screenshots and miss temporal structure — they can list widgets but cannot understand what a sequence of UI actions accomplishes. Crawler-based systems handle specific tasks but generalize poorly to unseen ones.

The hypothesis is that user intent is a temporal property of UI activity, not a spatial property of any frame. UI-JEPA therefore processes video sequences of UI actions during task execution, training a JEPA-based encoder with temporal masking on unlabeled UI video — predicting fully masked frames from unmasked frames. Because predicting masked frames forces the encoder to capture temporal relationships and task structure, the resulting representations encode what the user is trying to do, not just what is on the screen.

The decoder side is an LLM conditioned on these representations to produce textual user-intent descriptions. The empirical claim that earns its keep is data efficiency: fine-tuning the decoder requires a fraction of the paired video-text data and compute that SOTA MLLMs need. This matters because labeled UI video is scarce and expensive — the architecture trades the bottleneck of paired labels for the abundance of unlabeled screen recordings.

The broader implication is a separation of concerns: temporal/structural understanding learned self-supervised on unlabeled streams, semantic intent inference layered via a small LLM decoder on top. When labeled data is scarce, the right move is to push the learning into self-supervision and keep the supervised layer thin. This is the same architectural move as Why do vision-only GUI agents struggle with screen interpretation? — factor the perception sub-problem out of the foundation model and hand it the structured signal it can actually use.


Source: Tool Computer Use

Related concepts in this collection

Concept map
14 direct connections · 101 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

predictive video masking on UI activity learns user intent without paired text — JEPA-style self-supervision turns unlabeled screen recordings into a usable signal