SYNTHESIS NOTE
Conversational AI and Personalization Language, Text, and Discourse Model Architecture and Internals

Can reference resolution work as a language modeling problem?

Can conversational, background, and on-screen references be resolved by converting them into text and using language models instead of specialized multimodal systems? This matters because it could enable efficient, on-device reference understanding.

Synthesis note · 2026-06-03 · sourced from Question Answer Search

Understanding ambiguous references ("they", "that", "the second one") is essential for a natural assistant, and the context that resolves them includes not just prior turns but non-conversational entities — things on the user's screen or running in the background. ReALM's move is to convert reference resolution of all these types into a language-modeling problem: encode entity candidates as natural text, and — critically — represent on-screen entities with a novel textual encoding that summarizes the screen while preserving the relative spatial positions of elements. Reduced to text this way, even on-screen references become tractable for an LM. The payoff is efficiency: the smallest ReALM model achieves over 5% absolute gains on on-screen references versus a prior system, matches GPT-4 overall, and outperforms GPT-4 on domain-specific utterances — all small enough to run on-device.

The keeper is the representational trick: a hard, partly-spatial problem is solved by finding the right textual encoding (spatial layout serialized to text with preserved positions), letting a small model do what seemed to need a frontier multimodal one.

This connects conversation and screen-understanding threads. It is the text-encoding counterpart to the pixel-based screen work — where Can one model understand both UIs and infographics equally well? (ScreenAI) reads screens visually, ReALM serializes screen entities to text — and both show small specialized models matching frontier models on screen tasks.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 88 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reference resolution including on-screen entities can be cast as language modeling and a small model matches GPT-4