Why do pretrained LLM representations fail at task-specific relevance ranking?

This explores why an LLM's general-purpose representations — the embeddings it learned during pretraining — don't reliably rank items by how well they fit a specific task, and what the corpus says about closing that gap.

This explores why an LLM's general-purpose representations don't reliably rank items by task-specific relevance, and the corpus points to one root cause again and again: pretrained representations measure *semantic association*, not *fitness for the task*. The clearest statement of this is that embeddings encode co-occurrence patterns, so concepts that are semantically close but play different roles look nearly identical to the model — fine in a demo, but in production an underspecified query surfaces many wrong-but-associated candidates Do vector embeddings actually measure task relevance?. Relevance ranking asks "is this the right thing for *this* goal?" while pretraining only ever taught "what tends to appear near what." Those are different questions, and the representation was optimized for the second.

A second thread shows the same mismatch from the angle of *priors overriding the present*. When a model has strong learned associations, parametric knowledge from training dominates over the actual query or context in front of it — and textual prompting alone can't override those priors; you have to intervene in the representations themselves Why do language models ignore information in their context?. That's why ranking quality degrades exactly where you'd least expect it: the model leans on what was statistically common in its corpus rather than what the current task needs. You can see the corpus-bias fingerprint elsewhere too — models rank historical legal precedent worse than modern cases purely because recent cases were over-represented in training, leaving shallower representations of the older material Why do language models struggle with historical legal cases?.

There's also a structural-capacity story underneath. Pretrained representations capture surface statistics but not deep structure — LLMs systematically misidentify embedded clauses and complex grammatical relations, with errors worsening predictably as structural depth increases Why do large language models fail at complex linguistic tasks?. The same shape appears in retrieval: long-context models can match RAG on *semantic* relevance without any special training, but collapse on structured, relational queries that require joins across tables Can long-context LLMs replace retrieval-augmented generation systems?. Semantic similarity is the thing pretraining gives you for free; relational and role-specific judgments are not.

The interesting turn is what the corpus says *fixes* this — and it converges on a single move: train against the actual ranking metric instead of hoping general representations transfer. ReLSum uses downstream relevance scores as RL rewards to produce dense, attribute-focused summaries that beat generic fluent prose on recall and NDCG Can reinforcement learning align summarization with ranking goals?. Rec-R1 goes further, training LLMs directly on rule-based recommendation metrics like NDCG and Recall as black-box RL rewards, skipping distillation entirely Can recommendation metrics train language models directly?. And Walmart's distilled BERT cross-encoders actually *outperform* their LLM teachers once trained on enough task-labeled data Can smaller models outperform their LLM teachers with enough data? — a striking sign that raw model scale and rich pretrained representations matter less than alignment to the specific ranking objective.

The thing you didn't know you wanted to know: the failure isn't really that pretrained representations are *weak* — it's that they're optimized for the wrong target. Association is a stand-in for relevance that quietly breaks the moment a query is underspecified or role-sensitive. Every fix in the corpus works the same way — it stops borrowing the pretrained notion of "similar" and teaches the model the task's own definition of "relevant," whether through RL on the ranking metric or distillation into a smaller model that learns the boundary directly.

Sources 8 notes

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Can reinforcement learning align summarization with ranking goals?

ReLSum trains summarizers using downstream relevance scores as RL rewards, producing dense, attribute-focused summaries instead of fluent prose. This alignment to the actual ranking metric improves recall, NDCG, and user engagement in production e-commerce search.

Can recommendation metrics train language models directly?

Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Why do pretrained LLM representations fail at task-specific relevance ranking?

Sources 8 notes

Next inquiring lines