AI Generates Accurate Images of Streets From Sound Recordings

Dec 03, 2024

Pesala Bandara

A street lined with red brick buildings and leafless trees, with cars parked on both sides. The road descends slightly, leading to a crosswalk and distant intersection. Street lamps and traffic signs are visible along the way. — A real street in Boston, Massachusetts.

Researchers have developed an AI system that turns an audio recording into an accurate image of the street that the sound came from.

A team of researchers at the University of Texas at Austin wanted to determine if audio clips alone are sufficient for AI to understand the visual characteristics of its environment, a skill once thought to be exclusive to humans.

The team used generative AI to successfully convert sounds from audio recordings into street-view images. According to a news release by the University of Texas, the visual accuracy of these generated images demonstrate that machines can replicate the human connection between audio and visual perception of environments.

In a paper published in Computers, Environment and Urban Systems, the team explain how they sampled 100 YouTube video and audio clips from cities in North America, Asia, and Europe. They used these clips to initially train an AI model that could produce high-resolution images from audio input on what various environments look and sound like.

From there the technology was fed 10-second, audio-only clips, and asked to produce high-resolution images of what the setting looks like.

A chart comparing AI-generated and original images in urban and rural settings. It highlights variations in the percentage of sky and greenery present in different geographical environments.

The researchers then compared AI sound-to-image creations made from 100 audio clips to their respective real-world photos, using both human and computer evaluations. Computer evaluations compared the relative proportions of greenery, building, and sky between source and generated images.

“Our study found that acoustic environments contain enough visual cues to generate highly recognizable streetscape images that accurately depict different places,” Yuhao Kang, assistant professor of geography and the environment at UT and co-author of the study.

“This means we can convert the acoustic environments into vivid visual representations, effectively translating sounds into sights.”

The study also worked to determine how well humans could match audio to images. When given an audio clip and three images to choose from, humans were able to accurately predict the setting 80% of the time. The researchers say the success rate was similar to the AI’s rate of accurately generating an image of the environment.

Kang says that there could be numerous potential applications for the AI system.

“For instance, we may understand our soundscape, such as, how can we reduce noise,” Kang says. “We can also enrich our multi-sensory experiences. For instance, when you visit a specific place in a museum or in (virtual reality), we may only see the world, but now we can also generate its soundscape.”

Image credits: Header photo licensed via Depositphotos.