Microsoft recently published an article about the advancements they are making in developing technology that can automatically caption pictures. (See
here.)
However, from the point of finding images on the Internet there is one big flaw in where they are headed. Captions are not what’s needed. What’s needed is a way to find an image that fulfills a specific need among a huge pile of images that might be appropriately described using all the same words.
In most cases there will be a huge number of choices that can reasonably have the same caption. Nobody wants to take the time to look through all of them. Image users (customers) want (need) someone to narrow the search for them.
One of Microsoft’s examples is “A woman holding camera in a crowd.” To get there they first identify elements in the picture with words. Then they use the words to create sentences. And finally they rank the sentences and pick the one that seems most logical for the caption.
In the example shown the computer vision software thought the woman’s hair was a cat.
Someone searching for just the word “cat” would not be happy with this picture. Fortunately, the computer program decided that “A woman holding a cat” was not the proper caption for this image, but since it thinks there is a cat in the picture that becomes a legitimate keyword to include in a list of tags.
Searching For Images
While I didn’t expect to find this specific picture, I would hope that if I use basically the same words I would find something similar. I went to
Shutterstock.com, searched for “woman holding camera crowd” and got the 151 images you get using those keywords. Most are either “woman holding camera” or “woman in crowd,” not all 4 elements together. Most aren’t at all appropriate to the image I have in my minds eye.
At
iStock I only get 56 returns, but many of them are not appropriate. Some are of men with a camera. At
gettyimages.com I got 138 images that seem to have little of no relationship to what I am looking for. Then I went to
Flickr and started my search with “woman camera crowd” because I don’t think many image searchers would use that word “holding.” I got 5,983 returns. This is way more images than anyone would want to review, but most of the early returns not similar at all to the image I was looking for. When I added holding it narrowed the search to 672 images, but nothing like what I wanted.
Also it seems to me that “brown” is just as dominate a color as “purple” in the original picture. Flickr allows me to easily narrow the search for those dominant colors, but using either color did not narrow the search in any useful way. I also get a lot of pictures that either had no camera, or no crowd in the picture.
Finally I went to
images.google.com and
images.bing.com. The results in both these searches were even more disappointing.
The key is not in just writing generic captions for pictures you’ve found, or have in hand. The key is finding pictures among a host of other images of wide ranging quality and relevancy that legitimately use the same words to describe their elements.
If this is the best technology has to offer it will be a long time before there is likely to be anything that will those searching the Internet for images narrow their searches and a long time before customers stop needing editors to help them find useful images.