What makes a problem instance unfamiliar to a language model?

This explores what actually makes a specific problem hard for a language model — and the corpus points to a surprising answer: not how complex the problem is, but how closely it resembles things the model has already seen.

This explores what actually makes a specific problem hard for a language model. The most direct answer in the collection is also the most counterintuitive: reasoning models don't break down at some complexity threshold — they break at the boundary of *novelty*. A model can carry a long chain of reasoning flawlessly if it has seen similar instances, and stumble on a logically trivial one it hasn't. The work on instance-level unfamiliarity Do language models fail at reasoning due to complexity or novelty? frames this sharply: models fit patterns tied to specific instances rather than learning a general algorithm, so 'familiarity' — not difficulty — is the real axis along which they succeed or fail.

Why would that be? Another line of work reframes the model as an autoregressive probability machine and predicts failure from the statistics of the target answer Can we predict where language models will fail?. Tasks whose correct response is *low-probability* under training — counting letters, reciting the alphabet backwards — are systematically hard even though a child could do them. So 'unfamiliar' can mean two overlapping things: an instance unlike the training examples, and an answer that's improbable given everything the model absorbed. Both are about the shape of the training distribution, not the logical hardness of the task.

The interesting twist is that models seem to *register* this unfamiliarity internally before they act on it. Under out-of-distribution shift, hidden states sparsify in a localized, systematic way that tracks task novelty — a kind of adaptive filtering rather than a breakdown Do language models sparsify their activations under difficult tasks?. Difficulty itself turns out to be linearly decodable from internal representations before reasoning even begins Can models recognize question difficulty before they reason?. The model 'knows' it's in unfamiliar territory; it just doesn't always change its behavior accordingly. That gap — perception without commitment — is its own failure mode.

Unfamiliarity also isn't only about raw novelty. An instance can become unfamiliar when its correct handling depends on something *unstated* — a background precondition the model never brings forward. The frame-problem work shows accuracy jumping from 30% to 85% simply by forcing the model to enumerate the implicit constraints it would otherwise skip Do language models fail at identifying unstated preconditions?. And an instance can be made unfamiliar by *conflict*: when the prompt supplies information that contradicts strong training-time associations, the parametric prior wins and the model effectively treats the in-context fact as noise Why do language models ignore information in their context?. Familiarity, in other words, can override the evidence right in front of it.

The thread connecting all of this is that 'unfamiliar' is a statement about the relationship between an instance and the training distribution, not a property of the problem in isolation. That's also why a model can explain a concept perfectly and then fail to apply it to a novel case — the explanation pathway is familiar territory, the execution pathway isn't Can language models understand without actually executing correctly? Can LLMs understand concepts they cannot apply?. If you want the map of how these distribution-edge failures cluster together, the survey of epistemic failure modes is the doorway How do LLMs fail to know what they seem to understand?. The takeaway you might not have gone looking for: making a problem 'easier' for a model often means making it *more familiar*, not more simple — and those are not the same lever.

Sources 9 notes

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Can models recognize question difficulty before they reason?

Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

What makes a problem instance unfamiliar to a language model?

Sources 9 notes

Next inquiring lines