When there’s something in the news regarding photography, like Stanford’s open source camera, I’m usually not the first to post about it. However, since I have a background in both photography and computer science, hopefully I can provide some unique insight into certain news stories.
The big story this past week has been PhotoSketch, a research project out of China’s prestigious Tsinghua University. The claim is that this program can take your rough, labeled sketches of various scenes, and automatically turn them into photo montages by combining the appropriate photographs obtained from the web. The following video posted to Vimeo demonstrating the technology has gotten over half a million views over the past week.
There are two main features that allow PhotoSketch to work. The first is filtering out undesirable images to obtain suitable ones, and the second is a novel blending algorithm that creates a seamless composition.
The key idea is that the user of the program actually does a lot of the hard work, making the job of the program a lot simpler. What’s great is that the user doesn’t even realize they’re doing a lot of work. A similar example might be CAPTCHAs, those security keys you type in to verify you’re a human. It’s pretty trival for a human to do, but (currently) very difficult for a computer.
Likewise, labeling the semantics of a photo is something very difficult for computers to do. If you gave the program unlabeled photographs, how would the program distinguish between a man reaching for something and a man throwing a ball, if both have similar shape and form? A computer can determine shapes and colors, but has an impossible time figuring out the meaning of photographs without human participation.
Since the user provides both a shape and a label, the problem becomes a shape matching problem, which isn’t nearly as difficult. The program only has to search through images that humans have previously labeled as being suitable.
In order to make it easier to extract the desired subjects from photographs, the filtering process actually throws away images that don’t have clear, uncluttered backdrops. For example, a tiger that blends into grass would be discarded, as would a lego piece among many lego pieces. This makes sense, since we all know an object is much easier to isolate from a photo when it’s very distinct from the background. In Photoshop you can simply use the magic wand or quick selection tools to eliminate the background.
Now I’ll briefly describe the various steps that go into making the program work.
Obtaining the Background
The main observation for selecting a background is that if you find all the images with a certain label (i.e. beach, mountain, meadow, etc…), you can group them by similarity. They assume that the largest “cluster” of similar images is probably what the user is looking for, so they choose 100 of the background images that are most similar to the characteristics of this cluster.
Next, they take these 100 images, and throw out the ones that don’t have the horizon line in the correct place. With the remaining images, they filter out images that have non-uniform backgrounds in order to have clean, open spaces on top of which the item images can be placed. At the end of this stage, they keep about 20 background images as possible candidates.
Selecting Scene Items
Once candidate background images have been obtained, the program searches for images that match the labels of the items in the scene. As with background selection, images that are too complicated or too cluttered are filtered out. The items need to be very distinct from the background in order for the program to isolate them.
The program then compares the extracted items with the shape the user drew, if a shape was provided. Images that don’t match are discarded, and the ones that do match are clustered together, just like in background selection. Images that both match the shape well and are part of a popular cluster are selected as candidate images.
Blending the Images
The novel methods used to blend the candidate images together is actually one of the main areas of research for this project. Everything I’ve explained prior to this section isn’t very groundbreaking, while everything related to this section is too complicated and technical to be easily explained. I’ll just say a lot of work goes into making the images not look completely absurd against the selected backgrounds.
Real or Fake?
What I find funny is how many of the comments found around the web regarding PhotoSketch claim that it’s fake. If it were fake, it would be one of the greatest hoaxes of all time, since the research was done at a prestigious university and will also be presented at the ACM SIGGRAPH Asia conference in December.
However, this doesn’t mean the program is as perfect as the video demonstrations and examples published make it seem. Here are some examples from the paper of when the program generates a semantically ridiculous photo montage:
Anything automatically generated will have semantic flaws that create absurd and non-sensical images every so often. The examples provided by the PhotoSketch group are simply examples of when the program successfully does what it’s supposed to do (which is hopefully quite often). Does it always create images that look as nice or make as much sense as the examples? No, but the examples provide a good demonstration of the technology.
PhotoSketch is a pretty amazing idea that deserves all the attention it’s getting. It’s also a taste of what’s to come with regards to computer graphic technologies. I’m sure we’re going to see more and more mindboggling research projects and commercial products in the coming years.
Though the group is still working on an online demonstration, the research group’s website contains the user studies, and the research paper.
Image credits: The images used in this article were obtained from the research website and their paper.