From James Zou and Londa Ziebinger at Nature:
When Google Translate converts news articles written in Spanish into English, phrases referring to women often become ‘he said’ or ‘he wrote’. Software designed to warn people using Nikon cameras when the person they are photographing seems to be blinking tends to interpret Asians as always blinking. Word embedding, a popular algorithm used to process and analyse large amounts of natural-language data, characterizes European American names as pleasant and African American ones as unpleasant.
Now where, we wonder, would a mathematical formula have learned that?
Maybe it was listening to the wrong instructions back when it was just a tiny bit?
Seriously, machine learning, we are told, depends on absorbing datasets of billions of words annotated by graduate students or crowdsourced: “Such methods can unintentionally produce data that encode gender, ethnic and cultural biases.”
Mmmm. yes. If we crowdsourced to Twitter, we’d probably get lots of responses but…
Much of the bias is unintentional, of course; it is an artifact of finding the data where it is easy to find:
More than 45% of ImageNet data, which fuels research in computer vision, comes from the United States2, home to only 4% of the world’s population. By contrast, China and India together contribute just 3% of ImageNet data, even though these countries represent 36% of the world’s population. More.
For a similar reason, much social psychology research is problematic because the population studied has been psychology students—hardly a random sample of the residents of a community.
The algorithm is not doing any independent thinking or having independent experiences. Thus, if the same strings of information keep coming up, it “learns” that information as generally true, even if the frequency is mainly the outcome of the amount and type of attention paid to the subject. Or the algorithm can “learn” that men considerably outnumber women because of a convention in English grammar where pronouns or examples default to “he” in cases of uncertainty or indifference to outcome. Or because men may be more likely to speak up than women in situations that are recorded.
Can the bias problem be addressed? Yes, but usually after someone gets upset about a specific instance. Watson, after all, did not “know” that Toronto is Canada’s largest city, not an American city. Most Americans who are well-informed enough to risk Jeopardy would not have made that mistake. No one, it seems, had dumped into the mix the information that a large city in North America need not be in the United States. Presumably, someone has done so now but sources of possible misinformation arise daily in the meantime.
The good news is that human beings, unlike algorithms, can generate new ideas on our own so we, at least, can evaluate our biases.
See also: Bill Dembski: How human can thrive in a world of increasing automation We aim to show society a positive way forward in adapting to machines, putting them in the service of humanity rather than thwarting our higher aspirations.