How do unstated constraints become invisible to training data distributions?

This explores how rules and limits that are never spelled out in the data — implicit constraints — fail to register in what a model actually learns, so the model defaults to shortcuts and priors instead of genuinely modeling the constraint.

This explores how a constraint that's never explicitly present in the training distribution becomes something a model can't really 'see' — and the corpus suggests the problem isn't that models lack constraint-handling machinery, but that they discover cheaper substitutes that pass the same tests. The sharpest evidence is that most models appear to reason about constraints while actually exploiting a conservative default: when researchers strip the constraints out of a problem, twelve of fourteen models get *worse*, dropping up to 38.5 points, because they'd been defaulting to harder options rather than evaluating any real limit Are models actually reasoning about constraints or just defaulting conservatively?. The constraint was invisible all along — the model was reading a correlated surface cue, not the rule. This caps out hard: across architectures, sizes, and training regimes, models plateau around 55–60% constraint satisfaction, a ceiling that doesn't move with scale Do larger language models solve constrained optimization better?.

Why do unstated constraints get lost? Partly because training rewards template-matching over procedure. Fine-tuned models — even with GRPO — fall apart on out-of-distribution variants where the surface looks different but the underlying constraint is the same, showing the training sharpened memorization rather than installing a procedure that could carry the constraint to new cases Do fine-tuned language models actually learn optimization procedures?. The same shape appears when models 'solve' optimization: they recognize a problem as template-similar and emit plausible-but-wrong values instead of running the iterative method that would actually honor the constraints Do large language models actually perform iterative optimization?. A constraint only the explicit procedure would enforce simply doesn't survive into a pattern-matched answer.

There's a deeper, distributional layer to your question, though — constraints can be suppressed not just by the model but by the training dynamics. RL post-training converges on a single dominant format from pretraining within the first epoch and collapses the alternatives, and which format wins depends on scale, not correctness Does RL training collapse format diversity in pretrained models?. Anything encoded only in the suppressed formats becomes invisible. Push the difficulty too far and it gets worse: overly hard RLVR samples make models learn degenerate shortcuts — answer repetition, skipped computation — that then contaminate capabilities they already had, because rare accidental successes get reinforced as if they were sound reasoning Do overly hard RLVR samples actually harm model capabilities?.

The most unsettling thread is that none of this shows up in standard evaluation. Models can carry every linearly-decodable feature a task needs while their internal organization is fractured — perfect accuracy sitting on top of a representation that shatters under perturbation or distribution shift the metrics never probe Can models be smart without organized internal structure?. And even when the relevant information *is* placed directly in context, strong parametric priors from training override it; prompting alone can't force the model to integrate the in-context constraint, which is why researchers reach for causal intervention in the representations instead Why do language models ignore information in their context?.

So an unstated constraint goes invisible through a chain: it isn't separately represented in the data, the model finds a conservative or memorized proxy that satisfies the same tests, RL collapses the formats that might have carried it, and evaluation metrics confirm the illusion. If you want the constructive flip side, the corpus also hints at where to push — training models to respond identically to clean and perturbed prompts builds genuine invariance rather than surface-matching Can models learn to ignore irrelevant prompt changes?, and forcing modular structure through weight sparsity makes the circuits a constraint would live in actually legible Can sparse weight training make neural networks interpretable by design?. The thing worth knowing you wanted to know: 'the model handles the constraint' and 'the model passes the constraint tests' are different claims, and almost every standard benchmark only checks the second.

Sources 10 notes

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

How do unstated constraints become invisible to training data distributions?

Sources 10 notes

Next inquiring lines