Google’s Video-to-Audio Tool Generates Music From Pixels

If you have seen video clips from AI video generators such as OpenAI’s Sora, you will notice that there is no sound attached to them but Google’s DeepMind research laboratory may have come up with a solution.

Unveiling its video-to-audio (V2A) technology, Google says it has developed a tool that can generate synchronized audio using the video’s pixels. Editors can also insert language text prompts if they wish.

Google released a series of example videos using its AI video generator Veo and then “matching the characters and tone” with the V2A tool.

However, it’s not just AI videos that the V2A technology can be used on: Google DeepMind researchers say that it can be used on traditional footage, “including archival material, silent films and more.”

V2A can apparently generate an “unlimited number of soundtracks for any video input.” While text prompts can be used to guide the audio output with positive prompts or negative prompts, the latter guides it away from a certain tone or style.

“This flexibility gives users more control over V2A’s audio output, making it possible to rapidly experiment with different audio outputs and choose the best match,” DeepMind writes on a blog post.

To build its model, the Google researchers used a “diffusion-based approach” over autoregressive architecture. The V2A system encodes video input into a compressed representation then the diffusion models build the audio from random noise, a process that is guided by the visuals from the video. Then the audio output is decoded, turned into an audio waveform, and combined with the video data.

A diagram depicting a video-to-audio conversion process using AI. Video pixels and positive/negative prompts are encoded separately, processed through a diffusion model, compressed, decoded, and finally outputted as an audio waveform.

Google says it trained the model on video, audio, and additional annotations which help the model to understand the link between a visual event and an audio sound.

The researchers believe their model to be novel because the V2A technology can understand raw pixels while text prompts are optional.

“Also, the system doesn’t need manual alignment of the generated sound with the video, which involves tediously adjusting different elements of sounds, visuals, and timings,” it adds.

However, there are limitations since the audio output depends on the quality of the video input. Video distortion or other unwanted artifacts will have an effect on the quality of the audio.

Lip synchronization, which could be extremely beneficial to AI video creators, sounds like it has not been mastered on V2A because the model can’t be “conditioned on transcripts.”

It sounds like a fascinating tool for video editors. But for now, it will remain exclusive to the DeepMind researchers who want it to undergo “rigorous safety assessment and testing” before it is made available to the wider public.

What DeepMind didn’t mention is how the V2A tool was trained but as noted by Tech Radar, Google has a potential advantage thanks to it owning the world’s biggest video-sharing platform, YouTube.