Some of the most innovative ideas in artificial intelligence are methods for finding training data for the algorithms that don’t involve going out and finding, say, 30,000 pictures of can openers. The most successful Big Data players focus on ways to get other people to provide the best data.
When Google got started, every search engine was a mere “text” search engine. You typed in a word and it found that word on any searchable page on the internet. The goal was “results,” period. Your search engine got a million results? Ha! Ours got ten million results!
But that wasn’t Google’s model. Instead, Google engineers realized that textual information actually comprises only a small portion of the information within web pages. What was really valuable was the links on the pages. Why? Because links represent what other people think about a website.
People tend to think a lot about themselves. They believe that their content is the best and therefore should be the first result from a search engine. However, Google recognized that while all site managers believed they had the best sites for their own information, they were actually quite picky about the information on other sites. Linking is a very intentional act by the author of a website. Web authors don’t tend to link to someone else’s site unless they think that the information is important or meaningful. Thus, by focusing on the way that pages were linked and not just on their content, Google could get the authors of pages on the web to tell them which web pages they thought were most important.
All of the most successful AI projects tend to follow a similar pattern. One of AI’s biggest needs is lots of data, and one of the most important tasks is finding ways to get people to provide them with the best data… for free.
Currently, Facebook is utilizing hashtags applied to its Instagram photos to generate AI-based algorithms for detecting specific types of objects in images:
Having so many images for training helped Facebook’s team set a new record on a test that challenges software to assign photos to 1,000 categories including cat, car wheel, and Christmas stocking. Facebook says that algorithms trained on 1 billion Instagram images correctly identified 85.4 percent of photos on the test, known as ImageNet; the previous best was 83.1 percent, set by Google earlier this year. Tom Simonite, “Your Instagram #Dogs and #Cats Are Training Facebook’s AI” at Wired
Basically, it is taking all of the photos which are tagged with “dog” and “cat” and using them to train its software to identify dogs and cats in other photos. The Facebook engineers know that their training data contains pictures of dogs and cats because the users told them so. For free!
The best data mining comes from finding ways to get other people to give you data without even realizing that is what they are doing. Data scientist Luis von Ahn is a master of this. He has invented numerous games in whose object coincides with providing von Ahn’s companies with data that they can use to train their AI: Intro:
Like Tom Sawyer, von Ahn has found a simple and mischievous solution: turn the task into a game. Computer solitaire eats up billions of man-hours a year, he points out, and does nobody any good. But he says his “games with a purpose” will accomplish all sorts of useful tasks. Players will translate documents from one language to another or make it easier for blind people to navigate the Web—all while having fun. And unless they pay attention to the fine print, they may not even know they’re doing good. Polly Shulman, “The Player” at Smithsonian Magazine
One example is von Ahn’s “ESP” game. In this online game, two people are pulled together at random but cannot communicate with each other. They are each shown the same picture. The goal is to write words that they think the other user will also write, winning points for the words that both parties write.
So, to the users, this is a game. However, for an AI data specialist, it is training gold. The data specialists have gotten users to tag their photos without paying them to do so. Given a photo, the words that the two users agree upon are the things that a photo is most likely to represent. So, if it is a picture of the sun, both users will probably say “sun”. Thus, the photo gets marked as a photo of the sun. This can be used directly (i.e., pull up this picture if a user searches for the word “sun”), or it can be used as training data for an AI (i.e., look at all of the photos that are marked as “sun”, and try to build a detector for “sun”).
For AI to work you need data. The most innovative players in the AI space have learned how to get users to willingly and gladly provide them with data for free.
Jonathan Bartlett is the Research and Education Director of the Blyth Institute.
Also by Jonathan Bartlett: “Artificial” artificial intelligence: What happens when AI needs a human I?
When machine learning results in mishap: The machine isn’t responsible but who is? That gets tricky