Microsoft’s VASA-1 AI Can Make Any Person’s Image Move and Speak

Split image comparing a modern woman speaking animatedly on the left with the mona lisa edited to have an open mouth expression on the right.
Still images begin to talk and sing thanks to Microsoft’s VASA program.

Microsoft unveiled a new lip-syncing AI tool that transforms a still image of a person’s face into an animated clip of them talking or singing.

VASA-1 is not only capable of producing lip movements that are “exquisitely synchronized” with audio but it can also capture a “large spectrum” of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness.

Microsoft developed “holistic facial dynamics” and a head movement generation model that works in a face latent space. The company says that it “significantly outperforms previous methods comprehensively.”

VASA is currently just a research demonstration with no plans to release the product or allow others to use the API; essentially, Microsoft just wants to show off its lip-syncing model.

The company says that VASA will accept requests such as where the character should be looking, the crop on the subject’s head, and their emotions while talking which include neutral, happy, angry, or surprised.

Microsoft demonstrated VASA by using DALL-E 3 or StyleGAN2 to generate AI images of people but real photographs could be used; the president of the United States, for example, could be made to something they didn’t say — raising ethical questions around deepfakes and misinformation.

“Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications,” Microsoft says on the VASA-1 research page.

“It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans.

“We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection.

“Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there’s still a gap to achieve the authenticity of real videos.”

This is true, the examples posted up by Microsoft still have the touch of uncanny valley about them. But not everyone is so media literate and there are people out there who would believe a VASA-1 video is real.