What causes autoregressive generation to fail on out-of-corpus item identifiers?
This explores why a left-to-right language model, asked to emit an item ID it never saw in training, produces plausible-looking but nonexistent identifiers — and the corpus points to an architectural cause, not a model-quality one.
This reads the question as being about a specific failure: when an autoregressive model has to name something outside its known vocabulary of items — a product code, a document ID, an entity it wasn't trained on — it doesn't say "I don't have that." It confidently stitches together a valid-shaped but fictional identifier. The corpus suggests the root cause is structural, baked into how token-by-token generation works, rather than a matter of scale or tuning.
The sharpest framing comes from work on constraint satisfaction: autoregressive transformers lack a *retraction primitive* Why does autoregressive generation fail at constraint satisfaction?. Once a token is emitted, it can't be discarded. Generating a valid out-of-corpus ID is essentially a constraint-satisfaction problem — the string has to exist in some external set — but the architecture can only ever move forward, committing to each character before it knows whether the whole identifier resolves to anything real. There's no mechanism to backtrack when the partial assignment turns out to be invalid, so the model finishes the token sequence regardless. That same note's lesson — that symbolic solver integration works precisely because it supplies what the architecture lacks — is the tell that this is an architectural gap, not a knowledge gap.
A second thread explains *why the fabricated ID looks so plausible*: when a model has no grounded in-context answer, parametric knowledge from training takes over. Models generate outputs inconsistent with their actual context because strong prior associations override the information in front of them Why do language models ignore information in their context?, and textual prompting alone can't suppress those priors. For identifiers, this means the model reconstructs the *statistical shape* of a valid ID (the right prefix, length, character class) from training patterns rather than retrieving a real one — surface form without referent. This connects to the finding that LLMs capture surface patterns but not the deeper rules underneath Why do large language models fail at complex linguistic tasks?: an ID that matches the format but points nowhere is exactly the failure of surface-over-structure.
The corpus also tells you what *doesn't* fix it, which is often more useful. Throwing more context at the problem doesn't help: long-context models can match retrieval on semantic tasks but fail on structured, relational queries that require exact joins and lookups Can long-context LLMs replace retrieval-augmented generation systems?. And the model can't simply verify its own way out — self-improvement is formally bounded by a generation-verification gap, where every reliable fix requires something external to validate it What stops large language models from improving themselves?. An autoregressive decoder cannot check, mid-generation, that the ID it's emitting exists, because checking is the thing the architecture doesn't do.
The constructive answers in the collection all route around generation rather than improving it. Grounded refusal — constraining the model to answer only when it has real evidence and otherwise declining — is the cleanest defense Can RAG systems refuse to answer without reliable evidence?, trading coverage for integrity. Confidence-aware decoding helps too: calibrated token-probability uncertainty turns out to be a more reliable signal for "should I commit to this" than external heuristics Can simple uncertainty estimates beat complex adaptive retrieval?. The thing you didn't know you wanted to know: the fix for hallucinated identifiers may be less about teaching the model more IDs and more about giving it the one thing autoregression structurally denies it — the ability to take a token back.
Sources 7 notes
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.