When my younger children go to the movies, I always ask them what the movie was about when they get back.
Invariably, rather than summarizing key facts about the movie, they start recounting every last detail. Sometimes, they include information about the people sitting near them, the food, or even some other random story that they want to tell me.
However, when I ask what the movie is about, I’m not asking for a complete reiteration of everything they saw. If I wanted that, I would watch it myself.
Instead, I want the overall gist, a generalization. What was the theme? Who were the good guys and the bad guys? What was the biggest twist? I don’t need to know every joke or gag in the movie, just the basics.
Similarly, when I am teaching from a textbook, I often provide a summary sheet for my students of what they should really be taking away from the book. While it may take an author 20 pages to lay out the context of a problem, describe how to work out the problem, and then show examples, the main points can often be summarized in less than half a page.
Now, let’s say that my students asked me for a summary of a 20-page chapter, and I handed them a 30-page summary. Have I successfully summarized the chapter? It would have to mean that either (a) I included additional material that wasn’t in the chapter or (b) I actually didn’t understand the chapter well.
In machine learning, it has long been recognized that, for two models that have the same accuracy on a given set of data, the shorter model is more likely to be correct. This induction was originally formalized by Ray Solomonoff (1926–2009) and is typically called the Solomonoff induction. The Solomonoff induction is the quantification of Occam’s razor, which says that we shouldn’t add unnecessary complexity to theories. In the Solomonoff induction, if two models predict the same data, and one of them is smaller, then the other one has obviously unnecessary complexity. It is unnecessary because another model with the same predictive power doesn’t have that complexity. The Solomonoff induction has spawned a number of machine learning techniques, including Occam Learning, PAC-learning, and others.
One issue with the Solomonoff induction, however, is that it does not provide any minimum requirements for a model. That is, although we can compare two models, A and B, and determine which one is better, no procedure tells us if either model A or B is good enough. That is, do we have sufficient justification to think that model A or B will be a good predictor of future behavior? How would we know?
The answer to that question, as published in the latest issue of Communications of the Blyth Institute, turns out to be surprisingly simple. You can think of the data set itself as a model. That is, if you are trying to develop a model that describes 1,000 points, those points themselves are a data model for themselves that have perfect accuracy. Therefore, for a model to be considered a generalization of the data, it should be shorter than the data itself. So, while we can still compare models with each other, we actually have a standard model that will give us an absolute criterion to check against the size of the data itself.
In the paper, we describe the difference between the size of the data and the size of the model— the amount of “generalized information” that is present in the model. In other words, if I have 500 bits of data, and my model is 400 bits, then I have 100 bits of generalization.
The metric provided in the paper also accounts for imperfect models. Many machine learning models have the problem of “overfitting” the data. That is, the model winds up modeling the noise in the data rather than the signal. Imagine watching a noisy TV channel. If, when describing the picture, you also described the location of all of the noise as it comes across the screen, you would actually be giving your listener less important information than if you ignored the noise, even though your description would sacrifice some accuracy.
In our model of generalization, imperfect models can get better scores but they are discounted according to the amount of error they have. For instance, let’s say that we have a dataset that is 400 bits long. Program 1 is 100% accurate and is 260 bits long. Program 2 is 99% accurate but is only 190 bits long. Which model is better? We can use a concept called Active Information to get a discounted value for how many bits of accuracy we should attribute to each program. While Program 1 is predicting the full 400 bits, it turns out that Program 2 only accurately predicts 394 bits of information. We can use this to figure out how much generalization each one achieves. With Program 1, it predicts 400 bits using 260 bits, so it contains 140 bits of generalization. With Program 2, it predicts 394 bits using 190 bits, so it contains 204 bits of generalization. Therefore, despite being slightly less accurate, Program 2 more successfully generalizes the data.
This technique is implementation-agnostic; that is, the concept does not depend on the specific type of model you are using—neural networks, bayesian inference, etc. Because it is based on simple concepts such as model sizes and data sizes, we believe that it can be straightforwardly applied to any machine learning system.
Paper: Bartlett, Jonathan and Eric Holloway. 2019. “Generalized Information: A Straightforward Method for Judging Machine Learning Models.” (open access)
Also by Jonathan Bartlett: Prominent psychologist offers non-reductive approach to consciousness in journal article. A new edition of Communications of the Blyth Institute highlights mind, consciousness, and machine learning
Walter Bradley Center fellow Jonathan Bartlett discovers longstanding flaw in an aspect of elementary calculus