Microsoft’s New Image Captioning AI is More Accurate than Humans

AI researchers at Microsoft reached a major milestone this week: they managed to create a new “artificial intelligence system” that is, in many cases, actually better than a human at describing the contents of a photo. This could be a huge boon for blind and sight-impaired individuals who rely on screen readers and “alt text” when viewing images online.

While this might seem like one part of the prequel to Skynet, the development of a better image captioning AI has a lot of potential benefits, and warrants a bit of (cautious) celebration. As Microsoft explains on its blog: “[this] breakthrough in a benchmark challenge is a milestone in Microsoft’s push to make its products and services inclusive and accessible to all users.”

That’s because accurate automatic image captioning is used widely to create so-called “alt text” for images on the Internet—that’s the text that screen readers use to describe an image to sight-impaired individuals who rely on these accessibility options to make the most of their time online or when using certain apps on their smartphones.

Of course, Microsoft is careful to point out that the system “won’t return perfect results every time.” But as you can see from the examples in the video below, it’s far more accurate than the previous iteration. There’s a wide gulf between describing an image as “a close up of a cat” and describing that same image as “a gray cat with its eyes closed.”

“Ideally, everyone would include alt text for all images in documents, on the web, in social media – as this enables people who are blind to access the content and participate in the conversation. But, alas, people don’t,” explains Saqib Shaikh, a software engineering manager for Microsoft’s AI group. “So, there are several apps that use image captioning as way to fill in alt text when it’s missing.”

These apps can take advantage of the new system to generate accurate captions that “surpass human performance,” a claim that’s based on the nocaps image captioning benchmark that compares AI performance against the same data set captioned by humans.

Here’s another example of the improved AI in action, pulled from the video above:

Given the potential accessibility benefits of the improved captioning system, Microsoft has rushed this model into production and has already integrated it into Azure’s Cognitive Services, enabling interested developers to begin using the tech right away.

To learn more about this system and how it works, head over to the Microsoft blog or read up on the nitty gritty details here. Suffice it to say this isn’t exactly Skynet, but we can be pretty sure that future Terminators will be able to describe your photo library better than you can…

(via Engadget)

Discussion