How can we make decisions using models created after we have viewed the data? We discussed shaving after-the-fact models using Occam’s Razor last time. But what’s the next step?
When we make a complex decision, we must use a number of observations. Each time we make a correct observation, the range of possibilities for our final decision is reduced. Think of the game of Clue:
The classic detective game! In Clue, players move from room to room in a mansion to solve the mystery of: who done it, with what, and where? Players are dealt character, weapon, and location cards after the top card from each card type is secretly placed in the confidential file in the middle of the board. Players must move to a room and then make an accusation against a character saying they did it in that room with a specific weapon. The player to the left must show one of any cards accused to the accuser if in that player’s hand. Through deductive reasoning each player must figure out which character, weapon, and location are in the secret file. To do this, each player must uncover what cards are in other players hands by making more and more accusations. Once a player knows what cards the other players are holding, they will know what cards are in the secret file. A great game for those who enjoy reasoning and thinking things out.(Boardgamegeek.com)
So once we find out that the murder was not done with a noose in the kitchen by Ms. Peacock, we have reduced the range of hypotheses about the case. But not by much.
When we make our observations, we must be careful to ask the right questions. If we are consistently told that it was not Ms. Peacock in a variety of proposed scenarios, it makes sense to try a new suspect for our next guess. In other words, some guesses provide more information than other guesses. Which guess we make is determined by what we have learned so far. The overall goal is to maximize the amount of information we learn while minimizing the number of guesses.
This series of branching questions can be compared to a branching tree. We can call it a decision tree, where each question we ask is a branch on the tree. The goal is to make the most accurate decisions using the smallest tree possible.
About now you might notice a similarity between the earlier discussion of predictive models and this discussion of the decision tree. In both cases, we want to represent the data in the most concise way possible. So, is there is a correlation between the smallest decision tree we can construct and the decision procedure that results in the most accurate guess? Yes, there is.
Here’s a simplified example. We’ve narrowed down our Clue possibilities to the following:
- It is the Purple suspect with the candlestick weapon in the kitchen.
- It is the Green suspect with the candlestick weapon in the kitchen.
- It is the Blue suspect with the pistol weapon in the kitchen.
- It is the Red suspect with the pistol weapon in the kitchen.
We have to choose from two categories in which to guess, and we want to minimize the number of guesses we expect to take.
If we pick the suspect category (four suspects) to make our guesses, then in the worst case we will need to make three guesses to find the murderer. If we chose the weapon category, then we will only need to make two guesses to identify the murderer (two weapons). Using the weapon category for our first guess, and suspect for the second creates a more concise tree than just guessing suspects. This procedure fits our goal of minimizing the size of our tree while maximizing accuracy.
At this point we should also look at entropy. Entropy is a concept from information theory that characterizes the number of choices available when a probability distribution is involved. If there are four possibilities, any combination of which can be chosen, and the probability distribution is uniform, then the entropy is four. If the distribution is non-uniform, then the entropy is less than four. That means there are not really four equal possibilities because some possibilities are less likely than others.
In our Clue example, we can see that picking a good tree also matches up with picking the category that minimizes entropy. If we pick the suspect category, then we end up with entropy of 2.8, but if we pick the weapon category, we get entropy of 2.0. With fewer possibilities, there is a smaller probability of being wrong. Thus, lower entropy relates to a greater chance of being correct.
But, why not just use probability as a decision-making tool? The benefit of entropy over probability is that it has a nice connection with model size. We will deal with that next time.
Machine learning isn’t difficult; just different. A few simple principles open many doors:
Part 1 in this series by Eric Holloway is The challenge of teaching machines to generalize. Teaching students simply to pass tests provides a good illustration of the problems. We want the machine learning algorithms to learn general principles from the data we provide and not merely little tricks and nonessential features that score high but ignore problems.
Part 2: Supervised Learning. Let’s start with the most common type of machine learning, distinguishing between something simple, like big and small.
Part 3: Don’t snoop on your data. You risk using a feature for prediction that is common to the dataset, but not to the problem you are studying
Part 4. Occam’s Razor Can Shave Away Data Snooping. The greater an object’s information content, the lower its probability. If there is only a small probability that our model is mistaken with the data we have, there is a very small probability that our model will be mistaken with data we do not have.
Part 5: Machine Learning: Using Occam’s Razor to generalize. A simple thought experiment shows us how. This approach is contrary to Fisher’s method where we formulate all theories before encountering the data. We came up with the model after we saw the cards we drew.
For more general background on machine learning:
Part 1: Navigating the machine learning landscape. To choose the right type of machine learning model for your project, you need to answer a few specific questions (Jonathan Bartlett)
Part 2: Navigating the machine learning landscape — supervised classifiers Supervised classifiers can sort items like posts to a discussion group or medical images, using one of many algorithms developed for the purpose. (Jonathan Bartlett)