Can bounding boxes replace image encoders for document understanding?
Explores whether spatial layout information alone, encoded as bounding boxes, can capture the multimodal signal needed for document understanding without expensive visual encoding. Matters because image encoders add significant computational cost to document processing systems.
Enterprise documents — forms, invoices, contracts, receipts — carry meaning at the intersection of text and spatial layout, and most multimodal LLMs handle this with heavy image encoders. DocLLM's design choice is to drop the image encoder entirely and use only bounding-box information to incorporate spatial structure. It captures the cross-alignment between text and spatial modalities by decomposing the classical transformer attention into a set of disentangled matrices (separating textual and spatial contributions), and pretrains with a text-segment infilling objective suited to the irregular layouts and heterogeneous content of real documents.
The keeper is the cheap-spatial-signal move: bounding boxes are a lightweight, structured stand-in for full visual encoding, and disentangled attention lets the model reason over layout without the cost and brittleness of pixel encoders. The broader claim DocLLM gestures at — that layout-aware pretraining lets language models go beyond plain-text next-token prediction to treat documents as inherently structured knowledge — points at incorporating e-books and richly-formatted corpora into pretraining without heavy preprocessing.
This sits in the multimodal/document corner of the vault as the spatial-but-not-visual design point. It contrasts with the strong-vision GUI position of Do text-based GUI agents actually work in the real world?: where GUI agents argue real deployment needs pixels, DocLLM argues that for layout-structured documents, bounding boxes recover most of the spatial signal at far lower cost.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What visual patterns transfer between infographic and UI tasks when trained jointly?
- How does annotation-based pretraining compare to self-supervised video masking for screen understanding?
- What document layouts benefit most from bounding box representations?
- Why do GUI agents need pixels while document systems can use bounding boxes?
- How does serializing screen layout to text preserve spatial relationships?
- Can text-based and vision-based screen understanding achieve similar performance?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do text-based GUI agents actually work in the real world?
Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.
contrast: pixels-required for GUIs vs bounding-boxes-suffice for layout-structured documents
-
Can one model understand both UIs and infographics equally well?
Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?
adjacent document/UI understanding via a different (annotation-schema) route
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- DocLLM: A layout-aware generative language model for multimodal document understanding
- ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
- Nested Attention: Semantic-aware Attention Values for Concept Personalization
- Pixels, Patterns, but No Poetry: To See The World like Humans
- Training for Compositional Sensitivity Reduces Dense Retrieval Generalization
- Searching for Best Practices in Retrieval-Augmented Generation
- Emerging Properties in Unified Multimodal Pretraining
Original note title
layout-aware document understanding via bounding-box spatial signal and disentangled attention avoids expensive image encoders