New Machine Vision Algorithm Vastly Improves Robotic Object Recognition

Apr 21, 2022

Matt Growcoot

A team of scientists has created an algorithm that can label objects in a photograph with single-pixel accuracy without human supervision.

Called STEGO, it is a joint project from MIT’s CSAIL, Microsoft, and Cornell University. The team hopes they have solved one of the hardest tasks in computer vision: to assign a label to every pixel in the world, without human supervision.

Computer vision is a field of artificial intelligence (AI) that enables computers to derive meaningful information from digital images.

STEGO learns something called “semantic segmentation,” which is the process of assigning a label to every pixel in an image. It’s an important skill for today’s computer-vision system because as photographers know, images can be cluttered with objects.

Normally creating training data for computers to read an image involves humans drawing boxes around specific objects within an image. For example, drawing a box around a cat in a field of grass and labeling what’s inside the box “cat.”

The semantic segmentation technique will label every pixel that makes up the cat, and won’t get any grass mixed up. In Photoshop terms, it’s like using the Object Selection tool rather than the Rectangular Marquee tool.

The problem with the human technique is that the system demands thousands, if not hundreds of thousands, of labeled images with which to train the algorithm. A single 256×256-pixel image is made up of 65,536 individual pixels, and trying to label every pixel from 100,000 images borders on the absurd.

MIT Stego

Seeing The World

However, emerging technologies are requiring machines to be able to read the world around them for things such as self-driving cars and medical diagnostics. Humans also want cameras to better understand the pictures it is taking.

Lead author of the new paper about STEGO, Mark Hamilton, suggests that the technology could be used to scan “emerging domains” where humans don’t even know what the right objects should be.

“In these types of situations where you want to design a method to operate at the boundaries of science, you can’t rely on humans to figure it out before machines do,” he says, speaking to MIT News.

STEGO was trained on a variety of visual domains, from home interiors to high-altitude aerial shots. The new system doubled the performance of previous semantic segmentation schemes, closely aligning with what humans judged the objects to be.

“When applied to driverless car datasets, STEGO successfully segmented out roads, people, and street signs with much higher resolution and granularity than previous systems. On images from space, the system broke down every single square foot of the surface of the Earth into roads, vegetation, and buildings,” writes the MIT CSAIL team.

The Algorithm Can Still Be Tripped Up

STEGO still struggled to distinguish between foodstuffs like grits and pasta. It was also confused by odd images — such as one of a banana sitting on a phone receiver and the receiver was labeled “foodstuff,” instead of “raw material.”

Despite the machine still grappling with what’s a banana and what isn’t, the algorithm represents the “benchmark for progress in image understanding,” according to Andrea Vedaldi of Oxford University.

“This research provides perhaps the most direct and effective demonstration of this progress on unsupervised segmentation.”

Image credits: Header photo licensed via Depositphotos.