What power-law scaling patterns emerge when consistency models are trained at scale?
This asks specifically about consistency models — the distillation technique that lets diffusion-style generators produce samples in one or a few steps — and how their training behaves under power-law scaling; the collection doesn't actually cover consistency models, so this answer reframes around the scaling-law material it does hold.
Straight answer first: there's no note in this collection on consistency models as such — no work on consistency distillation, few-step generation, or how that family of generative models behaves as you pour in more compute. If that exact topic is what you came for, the corpus can't supply it, and it's better to say so than to dress up adjacent papers as if they answered it. What the collection does hold, richly, is the broader question your phrasing sits inside: when you train at scale, what kind of curve do you actually get — and where do the clean power laws break?
The headline pattern is that the tidy 'bigger is better' power law is more fragile than the original scaling-law story suggested. Does depth matter more than width for tiny language models? is the cleanest example: at small scale the shape of the model (deep-and-thin vs. balanced) matters more than the Kaplan laws predict, so the same parameter budget buys different accuracy depending on how you spend it. Scaling isn't one curve — it's several, and they don't always move together. Do pretraining and fine-tuning scale independently in language models? makes that explicit: pretraining scale and fine-tuning scale ride independent curves, one improving factual knowledge in the lower layers, the other improving behavioral helpfulness in the upper ones.
The more interesting material is about where the power law flattens into a wall. Do larger language models solve constrained optimization better? finds models converging to ~55–60% on constrained optimization regardless of size or training regime, and Why does autoregressive generation fail at constraint satisfaction? explains why it's a ceiling rather than a gap: autoregressive generation can't retract a token it already emitted, so no amount of scale buys the capability the task needs. Do large language models actually perform iterative optimization? shows the same shape — pattern-matching that looks like computation but doesn't improve with scale. The lesson that travels to any 'train at scale' question: a power law tells you about the regime where the bottleneck is capacity, and goes silent the moment the bottleneck becomes architectural.
There's also a strand on inventing new axes to scale along when the obvious one saturates. Do search steps follow the same scaling rules as reasoning tokens? finds that search steps obey the same diminishing-returns curve as reasoning tokens — a fresh inference-time axis beyond model size. Can reasoning systems scale wider instead of only deeper? scales width instead of depth, and Can latent thought vectors scale language models beyond parameters? adds a latent dimension that scales independently of parameter count. If you're chasing scaling behavior in any model family, this is the move to watch for: the cleanest power laws often show up not on the parameter axis everyone measures, but on a new axis someone bolted on once the old one flattened.
So the thing worth taking away that you might not have gone looking for: in this collection, the recurring story about scaling at scale isn't 'find the exponent' — it's 'find which axis the exponent lives on, and notice when you've hit a ceiling that no exponent can climb.' If you specifically need consistency-model scaling, that paper would have to be added to the library.
Sources 8 notes
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.