Can bounding boxes replace image encoders for document understanding?

Explores whether spatial layout information alone, encoded as bounding boxes, can capture the multimodal signal needed for document understanding without expensive visual encoding. Matters because image encoders add significant computational cost to document processing systems.

Synthesis note · 2026-06-03 · sourced from Multimodal

Enterprise documents — forms, invoices, contracts, receipts — carry meaning at the intersection of text and spatial layout, and most multimodal LLMs handle this with heavy image encoders. DocLLM's design choice is to drop the image encoder entirely and use only bounding-box information to incorporate spatial structure. It captures the cross-alignment between text and spatial modalities by decomposing the classical transformer attention into a set of disentangled matrices (separating textual and spatial contributions), and pretrains with a text-segment infilling objective suited to the irregular layouts and heterogeneous content of real documents.

The keeper is the cheap-spatial-signal move: bounding boxes are a lightweight, structured stand-in for full visual encoding, and disentangled attention lets the model reason over layout without the cost and brittleness of pixel encoders. The broader claim DocLLM gestures at — that layout-aware pretraining lets language models go beyond plain-text next-token prediction to treat documents as inherently structured knowledge — points at incorporating e-books and richly-formatted corpora into pretraining without heavy preprocessing.

This sits in the multimodal/document corner of the vault as the spatial-but-not-visual design point. It contrasts with the strong-vision GUI position of Do text-based GUI agents actually work in the real world?: where GUI agents argue real deployment needs pixels, DocLLM argues that for layout-structured documents, bounding boxes recover most of the spatial signal at far lower cost.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 77 in 2-hop network ·medium cluster Open in graph ↗

Can bounding boxes replace image encoders for do… Do text-based GUI agents actually work in the real… Can one model understand both UIs and infographics…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do text-based GUI agents actually work in the real world? Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.
contrast: pixels-required for GUIs vs bounding-boxes-suffice for layout-structured documents
Can one model understand both UIs and infographics equally well? Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?
adjacent document/UI understanding via a different (annotation-schema) route

Can bounding boxes replace image encoders for document understanding?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4