One technique to avoid data snooping is based on the intersection of information theory and probability: An object’s probability is related to its information content.
The greater an object’s information content, the lower its probability. We measure a model’s information content as the logarithmic difference between the probability that the data occurred by chance and the number of bits required to store the model. The negative exponential of the difference is the model’s probability of occurring by chance. If the data cannot be compressed, then these two values are equal. Then the model has zero information and we cannot know if the data was generated by chance or not.
For a dataset that is incompressible and uninformative, swirl some tea leaves in hot water until they are randomly distributed. To describe the tea leaves precisely requires a very long description, and so won’t be able to tell us anything. On the other hand the distribution of ink on a book page can be concisely described with a language, like English, so it is informative.
If the data can be compressed, resulting in a small model, then the difference is positive. If the data can be highly compressed, say by N bits, then there is a very good chance that the data was not generated by chance. We can quantify this chance precisely as less than 2 to the power negative N. As you can see, the probability drops rapidly as N increases, so a large N means a very small probability that we’ve made a mistake in modeling the data.
However, a large N means more than that we’ve correctly modeled the data. The critical point of a large N is that our model can predict new data that we have not yet seen. In other words, because there is such a small probability that our model is mistaken with the data we have, there is a very small probability that our model will be mistaken with data we do not have.
This approach to finding concise models for the data is a formalization of Occam’s famed “Razor.” Medieval philosopher William of Occam (c. 1287–1347) observed that a good explanation will not “multiply entities beyond necessity,” that is, “of two competing theories, the simpler explanation of an entity is to be preferred. ” On the other hand, the razor must not cut too close. As Einstein said, “simplify as much as possible, but no further.”
Next we will learn how we can use Occam’s razor to generalize.
Machine learning isn’t difficult; just different. A few simple principles open many doors:
Part 1 in this series by Eric Holloway is The challenge of teaching machines to generalize. Teaching students simply to pass tests provides a good illustration of the problems. We want the machine learning algorithms to learn general principles from the data we provide and not merely little tricks and nonessential features that score high but ignore problems.
Part 2: Supervised Learning. Let’s start with the most common type of machine learning, distinguishing between something simple, like big and small.
Part 3: Don’t snoop on your data. You risk using a feature for prediction that is common to the dataset, but not to the problem you are studying
For more general background on machine learning:
Part 1: Navigating the machine learning landscape. To choose the right type of machine learning model for your project, you need to answer a few specific questions (Jonathan Bartlett)
Part 2: Navigating the machine learning landscape — supervised classifiers Supervised classifiers can sort items like posts to a discussion group or medical images, using one of many algorithms developed for the purpose. (Jonathan Bartlett)