Can reference resolution work as a language modeling problem?

Can conversational, background, and on-screen references be resolved by converting them into text and using language models instead of specialized multimodal systems? This matters because it could enable efficient, on-device reference understanding.

Synthesis note · 2026-06-03 · sourced from Question Answer Search

Understanding ambiguous references ("they", "that", "the second one") is essential for a natural assistant, and the context that resolves them includes not just prior turns but non-conversational entities — things on the user's screen or running in the background. ReALM's move is to convert reference resolution of all these types into a language-modeling problem: encode entity candidates as natural text, and — critically — represent on-screen entities with a novel textual encoding that summarizes the screen while preserving the relative spatial positions of elements. Reduced to text this way, even on-screen references become tractable for an LM. The payoff is efficiency: the smallest ReALM model achieves over 5% absolute gains on on-screen references versus a prior system, matches GPT-4 overall, and outperforms GPT-4 on domain-specific utterances — all small enough to run on-device.

The keeper is the representational trick: a hard, partly-spatial problem is solved by finding the right textual encoding (spatial layout serialized to text with preserved positions), letting a small model do what seemed to need a frontier multimodal one.

This connects conversation and screen-understanding threads. It is the text-encoding counterpart to the pixel-based screen work — where Can one model understand both UIs and infographics equally well? (ScreenAI) reads screens visually, ReALM serializes screen entities to text — and both show small specialized models matching frontier models on screen tasks.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 88 in 2-hop network ·medium cluster Open in graph ↗

Can reference resolution work as a language mode… Can one model understand both UIs and infographics… Can bounding boxes replace image encoders for docu…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can one model understand both UIs and infographics equally well? Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?
pixel-based vs text-serialized routes to screen understanding; both let small models match frontier ones
Can bounding boxes replace image encoders for document understanding? Explores whether spatial layout information alone, encoded as bounding boxes, can capture the multimodal signal needed for document understanding without expensive visual encoding. Matters because image encoders add significant computational cost to document processing systems.
both recover spatial structure cheaply (bounding boxes / serialized positions) instead of full visual encoding

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reference resolution including on-screen entities can be cast as language modeling and a small model matches GPT-4

Can reference resolution work as a language modeling problem?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4