How do logic units preserve document structure better than fixed-size chunking?

This explores why breaking documents into four-part 'logic units' (prerequisite, header, body, linker) keeps how-to instructions intact in ways that slicing text into equal-length chunks does not.

This explores why breaking documents into four-part 'logic units' keeps procedural instructions intact in ways that fixed-size chunking does not. The core problem fixed-size chunking creates is that it cuts text at arbitrary boundaries — every N tokens — with no regard for where one step ends and the next begins. For how-to content, that's fatal: a chunk might contain step three of a process while the prerequisite that makes step three make sense sits in a different chunk entirely, and nothing tells the retriever they belong together. THREAD's logic units fix this by making structure explicit rather than incidental How do logic units preserve procedural coherence better than chunks?. Each unit carries its own prerequisite (what must be true first), a header (what this step is), a body (the actual content), and — crucially — a linker that points to the next step or branch. The linker is the part chunks can never have: it encodes the sequential dependency between pieces, so retrieval can walk a multi-step procedure instead of returning a disconnected fragment.

The deeper insight is that this is one instance of a broader pattern: matching the *shape* of stored knowledge to the *shape* of the question. StructRAG makes this explicit, showing that a router trained to pick the right structure — table, graph, algorithm, catalogue, or plain chunk — depending on what the query demands beats uniform retrieval across the board Can routing queries to task-matched structures improve RAG reasoning?. They ground it in 'cognitive fit' theory from cognitive science: reasoning is easier when the representation matches the task. Logic units are essentially the cognitive-fit answer for procedural how-to questions, the same way a table is the right fit for relational lookups.

What unites several notes here is that the thing chunking destroys is *discourse structure* — the document's sense of how its parts relate. MiA-RAG attacks the same loss from a different angle: instead of restructuring the units, it summarizes the whole document first and conditions retrieval on that global map, so scattered evidence becomes findable by its role in the document rather than by surface word-similarity Can building a document map first improve retrieval over long texts?. Logic units preserve local sequential structure; global-summary-first retrieval preserves the document's overall architecture. Both are reacting to the same failure of 'bag-of-chunks' retrieval, just at different scales.

It's worth noticing why you can't just dodge the whole problem by throwing the document into a long-context model. The LOFT benchmark shows long-context LLMs match RAG on semantic retrieval but fall apart on structured queries that require joining information across parts — context length alone doesn't recover relational structure Can long-context LLMs replace retrieval-augmented generation systems?. And reasoning quality actually decays as inputs get longer, dropping sharply well before the context window is even full Does reasoning ability actually degrade with longer inputs?. So preserving structure at indexing time isn't a nicety — it's doing work the model can't reliably do for itself at read time.

The thing you might not have expected to learn: 'better chunking' is the wrong frame. The papers converge on a different idea — that retrieval units should be designed around how a question will be *reasoned through*, not how text happens to be sliced. A logic unit's linker exists because answering a how-to question is a traversal, not a lookup. Once you see retrieval as matching structure to task, fixed-size chunking looks less like a baseline and more like the one structure that fits nothing in particular.

Sources 5 notes

How do logic units preserve procedural coherence better than chunks?

THREAD replaces chunks with four-part logic units—prerequisite, header, body, linker—enabling dynamic multi-step retrieval for how-to questions. Linkers explicitly navigate between steps and branches, addressing both the semantic-vs-task-relevance gap in embeddings and the sequential dependency loss in chunk-based RAG.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Can building a document map first improve retrieval over long texts?

MiA-RAG inverts standard RAG by summarizing documents first, then conditioning retrieval on that global view. This approach recovers discourse structure that bag-of-chunks retrieval destroys, making scattered evidence findable by their document role rather than surface similarity alone.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

How do logic units preserve document structure better than fixed-size chunking?

Sources 5 notes

Next inquiring lines