Do recency-focused prompts and in-context examples work equally well for order recovery?
This explores whether the two fixes for LLMs ignoring the order of a user's actions — telling the model to weight recent items, versus showing it worked examples — are interchangeable, or whether one does more work than the other.
This explores whether the two fixes for LLMs ignoring the order of a user's actions — telling the model to weight recent items, versus showing it worked examples — are interchangeable. The single corpus note that names both treats them as siblings: LLMs can read preferences out of an interaction history but discard the temporal order by default, and *both* recency-focused prompts and in-context examples "activate latent order-sensitivity," improving ranking without retraining Why do language models ignore temporal order in ranking?. So the honest answer is that the corpus presents them as two doors into the same room rather than measuring them head-to-head — and the more interesting finding is *why* they can both work at all, and why "equally" is probably the wrong question.
The deeper frame comes from a note on what prompting can and can't do: prompt strategies never inject new knowledge, they only reorganize what the model already learned Can prompt optimization teach models knowledge they lack?. Order-sensitivity isn't being taught here — it's already latent in the model, and both techniques are just different keys for the same lock. That reframes the question: you're not asking which method is *better*, you're asking which one more reliably surfaces a capability the model already has but suppresses by default.
And the corpus is emphatic that prompt techniques are almost never equal across settings — their effectiveness is conditional. One benchmark across 12 models found that rephrasing and background-knowledge prompts help cheap models while step-by-step reasoning actively *hurts* high-performance ones; task structure, not a universal best practice, decides which prompt wins Do prompt techniques work the same across all LLM tiers?. Another shows the optimal prompt flips with the *type* of question, not just the task category Why do some questions perform better without step-by-step reasoning?. By that logic, recency-instructions and in-context examples almost certainly don't trade evenly across model tiers, history lengths, or domains — the answer is contingent, and a fixed ranking of the two would be the wrong takeaway.
Here's the thing the reader might not expect: there's reason to suspect in-context examples work partly through *form* rather than content. Logically invalid chain-of-thought exemplars perform nearly as well as valid ones, and even deliberately corrupted reasoning traces teach about as well as correct ones — the model is picking up the shape of the demonstration as computational scaffolding, not absorbing its literal logic Does logical validity actually drive chain-of-thought gains? Do reasoning traces need to be semantically correct?. If that holds for ordering, an in-context example might recover order-sensitivity by showing the model the *pattern* of attending to sequence, while a recency instruction does it by direct command — two genuinely different mechanisms that could diverge sharply when the history gets long or noisy.
That last case matters because order is exactly where LLMs are most fragile: in gradually revealed, multi-turn settings they lock onto premature assumptions and lose 39% of performance, with mitigations clawing back only 15–20% Why do language models fail in gradually revealed conversations?. If you want to go deeper, the productive experiment the corpus points toward isn't "which fix is better" but "under what conditions does each one hold up" — short vs. long histories, weak vs. strong models, clean vs. interrupted sequences. Equally well? Almost certainly not. Usefully different? That's the line worth following.
Sources 7 notes
LLMs can extract preferences from interaction histories but disregard temporal order by default. Recency-focused prompts and in-context examples activate latent order-sensitivity, improving ranking without retraining.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.