Language Understanding and Pragmatics Knowledge Retrieval and RAG Design & LLM Interaction

Can we measure reading efficiency as a quality metric?

How can we quantify whether generated text delivers novel information efficiently or wastes reader attention through redundancy? This matters because standard coherence and fluency scores miss texts that are well-written but informationally dense.

Note · 2026-02-22 · sourced from Reasoning by Reflection

OmniThink defines Knowledge Density as: KD = Σ(unique_atomic_knowledge_units × uniqueness_indicator) / text_length. A high-KD text delivers novel atomic facts efficiently; a low-KD text repeats and elaborates the same points across more tokens. Low-KD content produces reader fatigue and disengagement; high-KD content enables efficient knowledge transfer.

The metric addresses a gap in standard LLM text evaluation. Coherence scores (does each sentence follow from the previous?) and fluency scores (is the grammar correct?) capture structural properties that can coexist with deep redundancy. A perfectly coherent, fluent article can spend 2000 words elaborating three facts that could be stated in 400 words. KD detects this failure where coherence and fluency scores do not.

Standard LLM-generated articles score lower on KD than human-written articles for two reasons: RAG retrieves topically redundant documents (similar queries return similar content), and language models trained on maximizing next-token probability tend to elaborate and expand rather than compress and advance. Both patterns inflate text length while holding unique knowledge content constant.

The cognitive science grounding: Bovair and Kieras (1991) established that reading cost scales with total text length while value scales with unique knowledge units. KD makes this ratio explicit and measurable. Readers don't consciously compute KD, but they experience its consequences as engagement vs. fatigue.

Connects to Why do ChatGPT essays lack evaluative depth despite grammatical strength?: the evaluative dimension missing from LLM academic writing — the ability to judge when an argument has been made and move on — is precisely what KD would detect as a quality failure. Also connects to Why does AI writing sound generic despite being grammatically correct?: structural coherence (grammar) can coexist with low KD (rhetoric failure — not advancing information efficiently).

Source: Reasoning by Reflection

Related concepts in this collection

Why do ChatGPT essays lack evaluative depth despite grammatical strength? ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.
KD operationalizes the missing dimension: ratio of novel information to total content; low KD is the measurable instance of evaluative absence
Why does AI writing sound generic despite being grammatically correct? Explores whether the robotic quality of AI text stems from grammatical failures or rhetorical ones. Understanding this distinction matters for diagnosing what AI systems actually struggle with in human-like writing.
fluency ≠ informational density; KD is the metric that captures the rhetoric side of the gap
Can human judges detect AI writing through lexical patterns? While AI text shows measurable differences from human writing across six lexical dimensions, judges—including experts—fail to identify AI authorship reliably. Why does perceptible quality diverge from measurable reality?
complementary measurement: lexical diversity tracks vocabulary variety across six dimensions; KD tracks information density per token; both reveal measurable human-AI gaps invisible to surface evaluation
Why does vanilla RAG produce shallow and redundant results? Standard RAG systems get stuck in a single semantic neighborhood because their initial query determines what documents are discoverable. The question asks whether fixed retrieval strategies fundamentally limit knowledge depth compared to iterative exploration.
application: KD metric was developed to diagnose RAG's redundancy problem; this note shows the systemic cause of low-KD RAG output
Do LLMs compress concepts more aggressively than humans do? Do language models prioritize statistical compression over semantic nuance when forming conceptual representations, and how does this differ from human category formation? This matters because it may explain why LLMs fail at tasks requiring fine-grained distinctions.
compression explains mechanism behind low KD: aggressive statistical compression eliminates the nuanced distinctions that create unique atomic knowledge units

Concept map

14 direct connections · 137 in 2-hop network ·dense cluster

Can we measure reading efficiency as a quality m… Why do ChatGPT essays lack evaluative depth despit… Why does AI writing sound generic despite being gr… Can human judges detect AI writing through lexical… Why does vanilla RAG produce shallow and redundant… Do LLMs compress concepts more aggressively than h…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

knowledge density — unique atomic knowledge units per token — is a measurable quality metric for generated text that reflects the cognitive cost of reading