Google’s New AI Video Generator Looks Incredible

Jan 25, 2024

Matt Growcoot

Google has announced Lumiere: an AI video generator that looks to be one of the most advanced text-to-video models yet.

The name Lumiere is seemingly a nod to the Lumiere brothers who are credited with putting on the first ever cinema showing in 1895. Just as motion picture was cutting-edge technology at the end of the 19th century, the Lumiere name is once more being associated with something new and original.

The demo of Lumiere that Google put out focuses firmly on animals. The model can generate a scene using just text; much the same way AI image generators work, the user can dream up any scenario they would like to see a short video clip of.

However, the user can also use an image as a prompt. Google provided multiple examples: including some that are real photos such as Joe Rosenthal’s iconic Raising the Flag photo; “Soldiers raising the united states flag on a windy day” saw one of the 20th-centuries most recognizable photos suddently come to life as the soliders struggle with the flag that’s being affected by gusts.

Jim Rhosenthal's Raising the Flag. — Joe Rosenthal’s *Raising the Flag*.

Also in Lumiere is a “Video Stylization” setting which allows users to upload a source video and then ask the generative AI model for various element changes. For example, a person running may be suddenly turned into a toy made of colorful bricks.

Another feature Google showed off is “Cinemagraphs”, where just a section of an image is animated while the rest stays still. “Video Inpainting” is included too which involves masking part of the image so that section can be changed to the user’s desire.

Space-Time Diffusion Model

Lumiere is powered by “Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model.”

This difficult-to-understand concept is apparently in contrast to existing video models which “synthesize distant keyframes followed by temporal super-resolution — an approach that inherently makes global temporal consistency difficult to achieve.”

As Ars Technica notes, essentially it means Lumiere can process the elements within the video and how they move simultaneously. Other text-to-video models put things together in small parts or frames.

Lumiere certainly looks to be an upgrade from the Imagen model Google touted in 2022 but there is no word if and when the AI video tool will be deployed.

Google is not clear on what training data was used for Lumiere, only saying in its paper that, “We train our T2V [text to video] model on a dataset containing 30M videos along with their text caption. [sic] The videos are 80 frames long at 16 fps (5 seconds). The base model is trained at 128×128.”

Image credits: Google