This AI Can Make an Eerily Accurate Portrait Using Only Your Voice

Apr 04, 2022

Michael Zhang

Photographs are made with the help of light, but what if portraits of people could be made with the sound of their voices? AI researchers have been working on reconstructing the face of a person using only a short audio recording of that person speaking, and the results are eerily impressive.

Artificial intelligence scientists at MIT’S Computer Science and Artificial Intelligence Laboratory (CSAIL) first published about an AI algorithm called Speech2Face in a paper back in 2019.

“How much can we infer about a person’s looks from the way they speak?” the abstract reads. “[W]e study the task of reconstructing a facial image of a person from a short audio recording of that person speaking.”

An AI with Uncanny Results

The researchers first designed and trained a deep neural network using millions of videos from YouTube and the Internet showing people talking. During this training, the AI learned correlations between the sound of voices and how the speaker looked. These correlations allowed it to make best guesses as to the speaker’s age, gender, and ethnicity.

There was no human involvement in the training process, as researchers did not need to manually label any subsets of data — the AI was simply given a huge trove of videos and tasked with figuring out correlations between voice features and facial features.

Once trained, the AI was remarkably good at creating portraits based only on voice recordings that resembled what the speaker actually looked like.

To further analyze the accuracy of the facial reconstructions, the researchers built a “face decoder” that creates a standardized reconstruction of a person’s face from a still frame while ignoring “irrelevant variations” such as pose and lighting. This allowed the scientists to more easily compare the voice reconstructions with the actual features of the speaker.

Again, the AI’s results were strikingly close to the real faces in a large percentage of cases.

Weaknesses and Ethical Issues

There were some cases in which the AI had difficulty figuring out what the speaker may look like. Factors such as accent, spoken language, and voice pitch were things that caused “speech-face mismatches” in which gender, age, or ethnicity were incorrect.

People with high voices (including younger boys) were often identified as female while people with low voices were labeled as male. An Asian man speaking English resulted in a less Asian appearance than when he spoke Chinese.

The reconstructed face of an Asian man speaking English (left) versus the same man speaking Chinese (right).

“In some ways, then, the system is a bit like your racist uncle,” writes photographer Thomas Smith. “It feels it can always tell a person’s race or ethnic background based on how they sound — but it’s often wrong.”

The researchers do note that there are ethical considerations surrounding this project.

“Our model is designed to reveal statistical correlations that exist between facial features and voices of speakers in the training data,” they write on the project page. “The training data we use is a collection of educational videos from YouTube, and does not represent equally the entire world population. Therefore, the model—as is the case with any machine learning model—is affected by this uneven distribution of data.

“[…] [W]e recommend that any further investigation or practical use of this technology will be carefully tested to ensure that the training data is representative of the intended user population. If that is not the case, more representative data should be broadly collected.”

Real-World Applications

One possible real-world application of this AI could be to create a cartoon representation of a person on a phone or videoconferencing call when the person’s identity is unknown and they do not wish to share their actual face.

“Our reconstructed faces may also be used directly, to assign faces to machine-generated voices used in home devices and virtual assistants,” the researchers write.

Law enforcement could presumably also use the AI to create a portrait showing what a suspect likely looks like if the only evidence is a voice recording. However, government applications would undoubtedly be the subject of a great deal of controversy and debate regarding privacy and ethics.

While generating realistic and accurate portraits of people from only their voices is a fascinating concept and previously the stuff of science fiction, the researchers are not aiming for that type of technology as the ultimate goal of this AI algorithm.

“Note that our goal is not to reconstruct an accurate image of the person, but rather to recover characteristic physical features that are correlated with the input speech,” the paper states. “We have demonstrated that our method can predict plausible faces with the facial attributes consistent with those of real images.

“We believe that generating faces, as opposed to predicting specific attributes, may provide a more comprehensive view of voice face correlations and can open up new research opportunities and applications.”