Can explicit rejection responses solve the over-specialization failure mode?

This explores whether giving a model the ability to say 'no' — to reject, discard, or retract an answer rather than always committing to one — can fix the way training collapses models into narrow, over-specialized behavior (shortcut amplification, lost diversity, premature convergence).

This explores whether an explicit 'reject' move — letting a model discard or retract instead of always committing forward — can undo over-specialization, the failure where models narrow into shortcuts or collapse their diversity. The corpus suggests the answer is a qualified yes, but with a sharp catch: *where* the rejection lives matters more than whether it exists. The most striking finding is that a plain autoregressive model can't really reject anything at all. Once a token is emitted it's load-bearing; the architecture has no retraction primitive, which is exactly why constraint-satisfaction problems stay capped no matter how good the model gets, and why bolting on a symbolic solver (which *can* throw away invalid partial assignments) works Why does autoregressive generation fail at constraint satisfaction?. So 'explicit rejection' isn't a knob you turn inside the same forward pass — it has to come from somewhere the architecture doesn't natively provide.

When rejection is staged *outside* the generation stream, it does real work against narrowing. A downstream verifier that operates on full token-interaction patterns can reject structural near-misses that the base retrieval pass waves through — rejection as a distinct, dedicated task rather than a property of the generator Can verification separate structural near-misses from topical matches?. The same logic shows up inside the training loop: step-level critique that rejects bad intermediate steps doesn't just raise test accuracy, it actively counteracts 'tail narrowing' and keeps solution diversity alive across self-training iterations Do critique models improve diversity during training itself?. That's the cleanest evidence in the corpus that a rejection signal can fight over-specialization at its root — premature convergence — rather than just cleaning up outputs.

But over-specialization isn't one disease, and that's where a single rejection mechanism strains. In RLVR, the narrowing comes from rare accidental successes on impossibly hard problems getting reinforced as high-advantage trajectories, so the model over-specializes into answer-repetition and computation-skipping Do overly hard RLVR samples actually harm model capabilities?. Rejecting bad final answers wouldn't catch this, because the failure is in the *process*, not the answer — which is the whole argument for verifying intermediate states, where success on a long-trace task jumped from 32% to 87% once you check the reasoning rather than the result Where do reasoning agents actually fail during long traces?. A 'reject the wrong answer' response is too coarse for a process-level pathology.

The most useful reframing is that the productive form of rejection is often retraction-in-flight, not refusal-at-the-end. Reasoning models fail by wandering down invalid paths and by abandoning good ones too early; the fix that works is a decoding-level thought-switching *penalty* — essentially the model rejecting its own impulse to bail — and it improves accuracy with no fine-tuning at all Why do reasoning models abandon promising solution paths?. And whether suppressing variation even counts as a problem is domain-dependent: preference tuning collapses diversity in code (where convergence is correct) but *increases* it in creative writing Does preference tuning always reduce diversity the same way?. So a rejection response that's healthy in one domain is a liability in another.

The thing you didn't know you wanted to know: 'explicit rejection' is really a stand-in for a primitive transformers structurally lack — the ability to discard work already done. The corpus doesn't show one rejection trick solving over-specialization; it shows that rejection only helps when you put it where the architecture can't: a separate verifier, a critique step inside training, or a decoding penalty against premature commitment. Refusal as an *output* is almost beside the point — what fixes narrowing is rejection as a *process control*.

Sources 7 notes

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can explicit rejection responses solve the over-specialization failure mode?

Sources 7 notes

Next inquiring lines