INQUIRING LINE

What structural constraints matter more than model depth for CF?

This reads 'CF' as constraint-following / constraint-satisfaction tasks, and asks: once you stop scaling the model, what architectural and environmental factors actually decide whether an LLM can hold a constraint?


Reading 'CF' as constraint following — the kind of task where a model must respect hard rules rather than just produce plausible text — the corpus is unusually blunt: depth and scale are mostly not the lever. The anchor finding is that LLMs plateau around 55–60% constraint satisfaction *independent of parameter count, architecture, or training regime*, with reasoning models offering no systematic edge Do larger language models solve constrained optimization better?. That's a ceiling, not a gap you climb by adding layers. So the interesting question becomes: what *structural* thing is missing?

The sharpest answer is a single missing primitive — retraction. Constraint solvers work by emitting partial assignments and *discarding* the invalid ones; autoregressive generation physically cannot take a token back, so it has no way to do the backtracking that constraint solving depends on Why does autoregressive generation fail at constraint satisfaction?. This is why bolting on a symbolic solver helps so much: it supplies the one operation the architecture lacks. A softer version of the same idea is that the win comes not from full formalization but from *partial* symbolic augmentation — enriching natural language with selective structure preserves meaning while adding the scaffolding the model can't generate on its own Why does partial formalization outperform full symbolic logic?.

A second structural constraint is execution bandwidth, not reasoning quality. Several apparent 'reasoning collapses' turn out to be execution failures: a model that *knows* the algorithm still can't run it step-by-step at scale when confined to text, and tool-enabled versions sail past the supposed cliff Are reasoning model collapses really failures of reasoning?. The bottleneck is procedural throughput, which more depth doesn't buy you. Relatedly, whether a domain is even tractable depends on its *environment* — immediate scalar metrics, modular structure, fast iteration — far more than on raw model power What makes a research domain suitable for autonomous optimization?.

There's also a measurement trap worth knowing about, because it makes the depth question look answered when it isn't. Most models appear to 'reason about constraints' but are really exploiting a conservative default — twelve of fourteen actually get *worse* when constraints are removed, meaning they were defaulting to the hard option, not evaluating the rule Are models actually reasoning about constraints or just defaulting conservatively?. And identical accuracy scores can sit on top of fractured internal representations, so a benchmark win tells you little about whether the structure that holds constraints is actually there Can models be smart without organized internal structure?.

The one place depth genuinely matters is the counterpoint that proves the rule: at sub-billion scale, deep-and-thin beats wide, because constraint-style composition happens *through* layers Does depth matter more than width for tiny language models?. But notice that's depth over *width* — a shape choice — not depth over the structural levers above. The corpus's quiet message is that for constraint following, the things that matter more than how big or deep your model is are architectural primitives (retraction), an execution channel (tools), the right symbolic scaffolding, and a domain whose structure is legible in the first place.


Sources 8 notes

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

What makes a research domain suitable for autonomous optimization?

Autonomous research pipelines require immediate scalar metrics, modular architecture, fast iteration cycles, and version control. Domains lacking any property resist autoresearch regardless of LLM capability, because the bottleneck is environmental structure, not model power.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Next inquiring lines