What power-law scaling patterns emerge when consistency models are trained at scale?

This asks specifically about consistency models — the distillation technique that lets diffusion-style generators produce samples in one or a few steps — and how their training behaves under power-law scaling; the collection doesn't actually cover consistency models, so this answer reframes around the scaling-law material it does hold.

Straight answer first: there's no note in this collection on consistency models as such — no work on consistency distillation, few-step generation, or how that family of generative models behaves as you pour in more compute. If that exact topic is what you came for, the corpus can't supply it, and it's better to say so than to dress up adjacent papers as if they answered it. What the collection does hold, richly, is the broader question your phrasing sits inside: when you train at scale, what kind of curve do you actually get — and where do the clean power laws break?

The headline pattern is that the tidy 'bigger is better' power law is more fragile than the original scaling-law story suggested. Does depth matter more than width for tiny language models? is the cleanest example: at small scale the shape of the model (deep-and-thin vs. balanced) matters more than the Kaplan laws predict, so the same parameter budget buys different accuracy depending on how you spend it. Scaling isn't one curve — it's several, and they don't always move together. Do pretraining and fine-tuning scale independently in language models? makes that explicit: pretraining scale and fine-tuning scale ride independent curves, one improving factual knowledge in the lower layers, the other improving behavioral helpfulness in the upper ones.

The more interesting material is about where the power law flattens into a wall. Do larger language models solve constrained optimization better? finds models converging to ~55–60% on constrained optimization regardless of size or training regime, and Why does autoregressive generation fail at constraint satisfaction? explains why it's a ceiling rather than a gap: autoregressive generation can't retract a token it already emitted, so no amount of scale buys the capability the task needs. Do large language models actually perform iterative optimization? shows the same shape — pattern-matching that looks like computation but doesn't improve with scale. The lesson that travels to any 'train at scale' question: a power law tells you about the regime where the bottleneck is capacity, and goes silent the moment the bottleneck becomes architectural.

There's also a strand on inventing new axes to scale along when the obvious one saturates. Do search steps follow the same scaling rules as reasoning tokens? finds that search steps obey the same diminishing-returns curve as reasoning tokens — a fresh inference-time axis beyond model size. Can reasoning systems scale wider instead of only deeper? scales width instead of depth, and Can latent thought vectors scale language models beyond parameters? adds a latent dimension that scales independently of parameter count. If you're chasing scaling behavior in any model family, this is the move to watch for: the cleanest power laws often show up not on the parameter axis everyone measures, but on a new axis someone bolted on once the old one flattened.

So the thing worth taking away that you might not have gone looking for: in this collection, the recurring story about scaling at scale isn't 'find the exponent' — it's 'find which axis the exponent lives on, and notice when you've hit a ceiling that no exponent can climb.' If you specifically need consistency-model scaling, that paper would have to be added to the library.

Sources 8 notes

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about scaling laws in generative models. The question remains open: what power-law scaling patterns emerge when consistency models (or any fast-inference generative family) are trained at scale?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints:
• Clean power laws hold only when capacity is the bottleneck; architectural ceilings (e.g., ~55–60% constraint satisfaction in autoregressive models, arXiv:2603.23004, ~2026) dissolve further scaling gains regardless of model size.
• Scaling is not one curve: pretraining and fine-tuning ride independent axes with different exponents; model shape (depth vs. width) matters more at sub-billion scale than Kaplan laws predicted (arXiv:2310.12962, ~2023; arXiv:2402.14905, ~2024).
• New scaling axes emerge once parameter count saturates: search/reasoning steps follow the same diminishing-returns law as tokens (arXiv:2506.18959, ~2026); latent-space width and recurrent depth scale independently of parameters (arXiv:2502.01567, 2502.05171, ~2025).
• Autoregressive generation cannot retract emitted tokens, creating hard constraints that no amount of scale overcomes (arXiv:2603.23004, ~2026).

Anchor papers (verify; mind their dates):
- arXiv:2603.23004 (2026): Large Language Model Reasoning Failures — hard ceilings under autoregressive regime.
- arXiv:2506.18959 (2026): From Web Search towards Agentic Deep Research — search steps as independent scaling axis.
- arXiv:2502.01567 (2025): Scalable Language Models with Posterior Inference of Latent Thought Vectors — latent-space scaling orthogonal to parameters.
- arXiv:2402.14905 (2024): MobileLLM — sub-billion parameter scaling regimes differ from large-scale.

Your task:
(1) RE-TEST the architectural ceiling claim: have post-2026 consistency distillation methods or retrieval-augmented generation systems relax the 55–60% constraint-satisfaction wall? Does inference-time search or reranking (not just scale) overcome the autoregressive retraction problem? Cite what if it's been resolved; state plainly where the ceiling still holds.
(2) Surface contradicting work from the last 6 months: any papers arguing that scale alone *does* overcome architectural constraints, or that consistency models follow fundamentally different scaling laws than autoregressive models?
(3) Propose two research questions assuming the regime has shifted: (a) Do consistency models' few-step generation and non-autoregressive decoding exhibit power laws on *different* axes (latency, sampling noise, distillation data efficiency) than parameter count? (b) Can latent-space scaling and consistency distillation be composed to achieve faster scaling per inference-time compute than either alone?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What power-law scaling patterns emerge when consistency models are trained at scale?

Sources 8 notes

Next inquiring lines