Can hierarchical entity extraction from books enable both textual and visual reasoning?
This explores whether building a tiered map of who-and-what appears in a book — and treating its images as real nodes, not afterthoughts — lets a system reason over both the words and the pictures together.
This question is really asking whether structure beats brute force: instead of dumping a book into a model and hoping, can you extract its entities into a hierarchy and treat text and images as one connected fabric? The corpus has a direct answer in MegaRAG, which builds hierarchical multimodal knowledge graphs from both text and visuals and uses them to answer cross-chapter, global questions that flat chunk retrieval simply can't reach Can multimodal knowledge graphs answer questions that flat retrieval cannot?. The key move is two-fold: the *hierarchy* lets the system zoom between high-level summaries and page-specific details, while making images *first-class graph nodes* means a figure can be reasoned about alongside the sentences around it, not bolted on separately.
Why not just feed the whole book to a long-context model? Because length quietly breaks reasoning. One study shows accuracy collapsing from 92% to 68% with only 3,000 tokens of padding — far below the context limit, and chain-of-thought doesn't save it Does reasoning ability actually degrade with longer inputs?. And long-context models, while they can match retrieval on semantic lookups, fail on structured, relational queries that require joining facts across entities Can long-context LLMs replace retrieval-augmented generation systems?. That's exactly the gap a hierarchical entity graph fills: it pre-computes the joins the model can't reliably do on the fly.
The "visual" half of the question has a quietly surprising answer elsewhere in the corpus. SignRAG shows that the most reliable bridge between an image and a knowledge base isn't raw embedding similarity — it's *describing the image in natural language* with a vision-language model, then retrieving against text Can describing images in text improve zero-shot recognition?. So "visual reasoning" over a book may not require a separate visual pipeline at all; turning figures into rich text descriptions lets them live in the same entity graph as everything else, and reasoning happens in one shared space.
The structural choices matter too. Hierarchical architectures that separate query planning from answer synthesis outperform flat ones on exactly the multi-hop questions books demand Do hierarchical retrieval architectures outperform flat ones on complex queries?. When a single fact involves three or more entities bound together — a character, a place, and an event — hypergraph memory preserves that joint constraint instead of shattering it into pairwise edges Can hypergraphs capture multi-hop reasoning better than graphs?. And there's a live design tension worth knowing about: pre-building a giant graph over a whole corpus is costly and goes stale, which is why some systems construct query-specific logic graphs at inference time instead Can query-time graph construction replace pre-built knowledge graphs?.
So the honest answer is yes, with a twist: hierarchical entity extraction is what makes both textual and visual reasoning possible over something as large as a book — but the visual reasoning may be smuggled in through language, and the hierarchy you build (pre-baked vs. query-time, graph vs. hypergraph) is itself a design decision with real trade-offs.
Sources 7 notes
MegaRAG builds hierarchical multimodal knowledge graphs from text and visuals to answer cross-chapter, global questions that flat chunk retrieval cannot reach. The hierarchy supports abstraction levels from high-level summaries to page-specific details while treating images as first-class graph nodes.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
HGMem organizes retrieved evidence as hyperedges rather than flat lists or binary graphs, allowing three or more entities to bind into single relations without decomposition. This structure accumulates coherent knowledge across retrieval steps, trading representational complexity for constraint expressiveness.
LogicRAG constructs directed acyclic graphs from queries at inference time rather than pre-building corpus-wide graphs, eliminating construction overhead, avoiding staleness, and enabling query-specific retrieval logic without sacrificing multi-hop reasoning capability.