Lumiere: A Space-Time Diffusion Model for Video Generation
We introduce Lumiere – a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion – a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by tem-
Introduction. Generative models for images have seen tremendous progress in recent years. State-of-the-art text-to-image (T2I) diffusion models are now capable of synthesizing highresolution photo-realistic images that adhere to complex text prompts (Saharia et al., 2022b; Ramesh et al., 2022; Rombach et al., 2022), and allow a wide range of image editing capabilities (Po et al., 2023) and other downstream uses. However, training large-scale text-to-video (T2V) foundation models remains an open challenge due to the added complexities that motion introduces. Not only are we sensitive to errors in modeling natural motion, but the added temporal data dimension introduces significant challenges in terms of memory and compute requirements, as well as the scale of the required training data to learn this more complex distribution. As a result, while T2V models are rapidly improving, existing models are still restricted in terms of video duration, overall visual quality, and the degree of realistic motion that they can generate.
Discussion / Conclusion. We presented a new text-to-video generation framework, utilizing a pre-trained text-to-image diffusion model. We identified an inherent limitation in learning globally-coherent motion in the prevalent approach of first generating distant keyframes and subsequently interpolating them using a cascade of temporal super-resolution models. To tackle this challenge, we introduced a space-time U-Net architecture design that directly generates full-frame-rate video clips, by incorporating both spatial, and temporal down- and upsampling modules. We demonstrated state-of-the-art generation results, and showed the applicability of our approach for a wide range of applications, including image-to-video, video inapainting, and stylized generation. As for limitations, our method is not designed to generate videos that consist of multiple shots, or that involve transitions between scenes. Generating such content remains an open challenge for future research. Furthermore, we established our model on top of a T2I model that operates in the pixel space, and thus involves a spatial super resolution module to produce high resolution images.