Can reference resolution work as a language modeling problem?
Can conversational, background, and on-screen references be resolved by converting them into text and using language models instead of specialized multimodal systems? This matters because it could enable efficient, on-device reference understanding.
Understanding ambiguous references ("they", "that", "the second one") is essential for a natural assistant, and the context that resolves them includes not just prior turns but non-conversational entities — things on the user's screen or running in the background. ReALM's move is to convert reference resolution of all these types into a language-modeling problem: encode entity candidates as natural text, and — critically — represent on-screen entities with a novel textual encoding that summarizes the screen while preserving the relative spatial positions of elements. Reduced to text this way, even on-screen references become tractable for an LM. The payoff is efficiency: the smallest ReALM model achieves over 5% absolute gains on on-screen references versus a prior system, matches GPT-4 overall, and outperforms GPT-4 on domain-specific utterances — all small enough to run on-device.
The keeper is the representational trick: a hard, partly-spatial problem is solved by finding the right textual encoding (spatial layout serialized to text with preserved positions), letting a small model do what seemed to need a frontier multimodal one.
This connects conversation and screen-understanding threads. It is the text-encoding counterpart to the pixel-based screen work — where Can one model understand both UIs and infographics equally well? (ScreenAI) reads screens visually, ReALM serializes screen entities to text — and both show small specialized models matching frontier models on screen tasks.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can one model understand both UIs and infographics equally well?
Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?
pixel-based vs text-serialized routes to screen understanding; both let small models match frontier ones
-
Can bounding boxes replace image encoders for document understanding?
Explores whether spatial layout information alone, encoded as bounding boxes, can capture the multimodal signal needed for document understanding without expensive visual encoding. Matters because image encoders add significant computational cost to document processing systems.
both recover spatial structure cheaply (bounding boxes / serialized positions) instead of full visual encoding
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- ReALM: Reference Resolution As Language Modeling
- Can Large Language Models Understand Context?
- From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
- OmniParser for Pure Vision Based GUI Agent
- The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs
- Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search
- UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation
Original note title
reference resolution including on-screen entities can be cast as language modeling and a small model matches GPT-4