Why do cascade pipelines fail to capture global motion structure?

This explores why video generation systems that build a clip in stages — make keyframes, then interpolate between them — produce motion that looks locally fine but globally incoherent.

This explores why cascade pipelines for video — generate sparse keyframes, then fill the gaps by interpolation — lose the larger arc of motion even when each piece looks right on its own. The corpus has a direct answer in Lumiere's design: a cascade stitches together fragments that were each generated without knowledge of the whole trajectory, so there's no point in the process where the model commits to a single coherent motion path. Can generating entire videos at once beat keyframe interpolation? makes the contrast explicit — by processing the entire temporal duration in one space-time pass rather than assembling independently-produced segments, global coherence emerges as a property of the whole rather than something you hope survives the seams.

The deeper issue is that the cascade treats time as a series of local interpolation problems. Between any two keyframes, the in-between frames are plausible; across the full clip, the trajectory wanders, because nothing in the architecture is responsible for the long-range relationship between distant moments. This is the same blind spot that shows up when video models are asked to actually reason about time: Can video language models actually understand time? finds that these systems excel at recognizing what's in a frame but lack mechanisms for modeling how frames relate over longer spans — causality, progression, the shape of an event. Motion structure is exactly that long-range relationship, and a pipeline built from local fills has no organ for it.

There's a useful cross-domain echo in how reasoning systems handle the same local-vs-global tension. Does step-level confidence outperform global averaging for trace filtering? shows the inverse failure — there, global averaging masks local breakdowns, so finer-grained local signal wins. Putting the two side by side sharpens the lesson: coherence isn't always about going more local or more global, it's about which level your supervision actually operates at. Video cascades supervise locally (does this interpolation look smooth?) while the property that matters — the motion's overall shape — lives globally, so it goes unmeasured and undefended.

The quiet takeaway is that "divide and stitch" is a bet that the whole equals the sum of well-made parts. For motion it doesn't, because the connective tissue between parts is itself the thing you care about. The fix that works is structural, not incremental: process the full trajectory at once so the model can't avoid committing to one continuous motion — the same way architectural inductive bias, not more scale, is what fixes structure-sensitive tasks elsewhere in the corpus (Can explicit stack tracking improve how transformers learn recursive syntax?).

Sources 4 notes

Can generating entire videos at once beat keyframe interpolation?

Lumiere's Space-Time U-Net generates entire video clips in a single pass via spatial-temporal down/up-sampling, achieving coherent motion where keyframe-plus-interpolation cascades fail. The key insight: global coherence emerges from processing the whole temporal trajectory at once, not from stitching independently-generated fragments.

Can video language models actually understand time?

Video LLMs struggle with long-term dependencies and abstract temporal concepts like causality and event progression. The architecture excels at spatial-frame recognition but lacks mechanisms to model relationships between frames over time.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Can explicit stack tracking improve how transformers learn recursive syntax?

Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.

Why do cascade pipelines fail to capture global motion structure?

Sources 4 notes

Next inquiring lines