Well, you can add “captioning photos” to the list of jobs robots will soon be able to do just as well as humans. After some training, the latest version of Google’s “Show and Tell” algorithm can describe the contents of a photo with staggering 94% accuracy.
Show and Tell is in the news today because Google actually made the model open source yesterday. You’ll have to train it yourself, but the source code is there for anybody who would like to try.
It’s amazing how far machine learning, especially in the field of photography, has come in the past several years. “This release contains significant improvements to the computer vision component of the captioning system, is much faster to train, and produces more detailed and accurate descriptions compared to the original system,” explains Google. How accurate? 93.9% accurate to be exact, which is pretty incredible.
Consider this old XKCD comic:
It’s easy to tell where a photo has been taken, but training a computer to “see” a photo and describe the contents seemed all but impossible until relatively recently. For Google to be able to look at a photo and tell that it shows “A person on a beach flying a kite” was unthinkable a decade ago:
But that’s what they’ve achieved using this new framework and some good old human training. By showing the AI pre-captioned images of a specific scene, Google was able to train the algorithm to properly caption similar (but not identical) scenes itself without help:
Google hopes open sourcing the advanced model will “push forward” research in this field. For us photographers, it’s just one step closer to auto-tagging and auto-captioning systems that mean you’ll never struggle to dig up an old photo from your archives ever again.