Picsart’s artificial intelligence research team (PAIR) has built a new generative model that can create entirely new video content from only text descriptions.
The technology, often described as text-to-video generative artificial intelligence (AI), has been released as an open-source demonstration on Twitter and has been published on GitHub and Hugging Face. The team behind it has also published a research paper describing the methodology.
“Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain,” the researchers explain.
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
— AK (@_akhaliq) March 24, 2023
The main problem with text-to-video generative AI right now is that while the general idea of what is being created is consistent, its presentation is not. The main subjects often look slightly different frame to frame and the background is also inconsistent, which makes the finished video look like everything is constantly in motion and it, therefore, lacks realism. The team attempted to combat this.
The researchers explain that their key modification compared to other attempts at text-to-video generation involves “enriching the latent codes of the generated frames with motion dynamics” which allows them to keep the global scene and background time consistent. They also have managed to better preserve the context, appearance, and identity of the foreground subject compared to many other generative video systems.
“Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation,” the researchers say.
“As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.”
The new generative AI doesn’t just make videos from text descriptions: it can also be used to change the appearance of an existing video, such as the case below where a video of a swan was changed by asking the AI to “make it Van Gogh Starry Night style.”
Unlike most research projects that can take months or years to be deployed publicly, it won’t be long until the PAIR text-to-video generative AI system becomes customer-facing. Picsart says that it plans to launch new software products that are built on this generative AI’s framework in the coming weeks.
Picsart isn’t the only one who is making progress with text-to-video AI. Google has been developing one, Meta started working on one last fall, and last week Runway published its second-generation text-to-video generator, which was the first to become publicly available.
Image credits: PAIR