The term “machine learning” can mean different things but the one we hear most about in media is supervised learning. Supervised learning is a bit like performing a science experiment.
We provide the machine learning algorithm with a lot of data, with known outcomes, and run the algorithm on the data to derive general principles that predict the known outcomes. The neat thing about machine learning is that the algorithm can extract general principles from the dataset that can then be applied to new problems. It is like the story that Newton observed an apple fall and then derived from it the general law of gravity that applies to the entire universe. It’s very helpful if we have a lot of data to sort and it would take a long time to do the job by hand.
Supervised learning work by giving the algorithm data in the form of a list of features and a classification. As an example, let’s say we are trying to classify whether an athlete is a basketball player or a jockey. The algorithm is trained on the set of features of height and weight. Each data item is given a label, either “basketball player” or “jockey.” The algorithm then infers a model from the data: Basketball players are tall and heavy and jockeys are short and light.
The algorithm then applies this model to new data to make predictions. For instance, given data that an athlete is short and light, the algorithm uses its model to predict that the athlete is a jockey. Likewise, when presented with data indicating that an athlete that is tall and heavy, the algorithm infers the athlete is a basketball player.
So, how is the algorithm able to derive the model that discriminates between two kinds of athletes based on the data? One simple way is to memorize the training data. Then, if a new item of data happens to match a line in the training data, the algorithm returns the label that has already been provided. This method is quick and dirty. We will need an enormous dataset to correctly identify many new athletes.
Another way is to divide the dataset into a training portion and a testing portion. The algorithm uses the training portion to derive a simpler prediction model and then tests it on the test portion. The cleverness of this scheme is that the algorithm can no longer get the correct answer by cheating, i.e. by just memorizing the training set. The training and test sets are different, so the algorithm needs a generally applicable model to make correct predictions on both the training and the testing datasets. In this way, if the algorithm finds a model that performs well on the test data, we have high confidence it can be applied to data outside of our overall sample. In other words, returning to our athlete example, if the algorithm derives the model that jockeys are short and light and basketball players are tall and heavy, then the model is likely to do well on whatever portion of the data it is tested on. This is because the height and weight thresholds are a general characteristic of the two athlete types.
But, when creating the general model, we must beware of ‘data snooping’. We will find out about this seemingly innocent but harmful practice next time!
Part 1 in this series by Eric Holloway is The challenge of teaching machines to generalize. Teaching students simply to pass tests provides a good illustration of the problems. We want the machine learning algorithms to learn general principles from the data we provide and not merely little tricks and nonessential features that score high but ignore problems.
For more general background on machine learning:
Part 1: Navigating the machine learning landscape. To choose the right type of machine learning model for your project, you need to answer a few specific questions (Jonathan Bartlett)
Part 2: Navigating the machine learning landscape — supervised classifiers Supervised classifiers can sort items like posts to a discussion group or medical images, using one of many algorithms developed for the purpose. (Jonathan Bartlett)