INQUIRING LINE

How should temporal metadata indexing differ from semantic indexing?

This explores why retrieving by *when* something happened is a fundamentally different operation than retrieving by *what it's about* — and why systems that treat them the same break.


This explores why retrieving by *when* something happened is a fundamentally different operation than retrieving by *what it's about* — and the corpus is surprisingly unanimous that conflating the two is an architectural mistake, not a tuning problem. The cleanest statement comes from conversational memory, where a system faces two challenges that static databases never do: time-based queries like "what did we discuss Tuesday?" need explicit metadata indexing, while semantic search answers "what did we say about pricing?" These aren't the same retrieval with different inputs — a date is a structured key you filter on, while a topic is a fuzzy similarity match in embedding space Why do time-based queries fail in conversational retrieval systems?.

Why can't semantic indexing just absorb the temporal case? Because embeddings measure association, not the kind of exact relational filtering that 'Tuesday' demands. The LOFT benchmark makes this concrete: long-context LLMs can match RAG on semantic retrieval with no special training, but they fall apart on structured queries requiring joins across tables — and a temporal lookup is exactly that kind of structured, relational query Can long-context LLMs replace retrieval-augmented generation systems?. The broader diagnosis of RAG failure points the same direction: embeddings measure semantic association rather than task relevance, and a query's *time* dimension is orthogonal to its *meaning* dimension. Stuffing both into one similarity score guarantees the temporal signal gets washed out Where do retrieval systems fail and why?.

Here's the part you might not expect: the difficulty isn't just in the index design, it's in the model itself. LLMs are systematically weaker at temporal reasoning than causal reasoning, because causal connectives appear explicitly and often in training text, while temporal order is usually implicit and must be inferred Why do LLMs handle causal reasoning better than temporal reasoning?. So you can't lean on the model to recover time from context the way it recovers meaning — which is the strongest argument for keeping time as *explicit external metadata* rather than hoping the embedding captures it. This compounds with a corpus-level bias: models show 'era sensitivity,' performing worse on older material simply because recent data dominates training, so chronology is unevenly represented even before you query Why do language models struggle with historical legal cases?.

The practical pattern that emerges across notes is a *hybrid two-track* design: semantic search for topical relevance, plus a separate metadata layer for time, then synchronize them. Long-video RAG does exactly this — it ranks retrieved text by temporal proximity and samples frames by entropy rather than uniform stride, keeping visual, audio, and subtitle evidence aligned to the same moments How can video retrieval handle multiple modalities at different times?. Temporal awareness is treated as a first-class ranking dimension layered *on top of* semantic retrieval, not folded into it.

The deeper note worth taking away: time isn't just another attribute to index, because AI's relationship to time is genuinely shallow. Token generation is sequential but atemporal — there's no duration, no revision, no felt before-and-after Does AI text generation unfold through temporal reflection?. That's the real reason temporal indexing has to be structural and external: the model has no intrinsic sense of when, so the index must carry what the model cannot. Semantic indexing leans into what the model is good at; temporal indexing exists to compensate for what it isn't.


Sources 7 notes

Why do time-based queries fail in conversational retrieval systems?

Conversational memory faces two distinct retrieval challenges absent from static databases: time-based queries ("what did we discuss Tuesday?") requiring metadata indexing, and ambiguous references ("tell me more about that") requiring contextual disambiguation before retrieval.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

How can video retrieval handle multiple modalities at different times?

TV-RAG ranks retrieved text by temporal proximity and selects key frames via entropy-based sampling, not uniform stride. This keeps visual, audio, and subtitle evidence synchronized at the same moments, enabling video LLMs to reason across modalities without retraining.

Does AI text generation unfold through temporal reflection?

Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.

Next inquiring lines