The scientific method calls for the rigorous testing of plausible theories, ideally through randomized controlled trials. For example, a study of a COVID-19 vaccine might give the vaccine to 10,000 randomly selected people and a placebo to another 10,000, and compare the infection rates for the two groups. If the difference in the infection rates is too improbable to be explained by chance, then the difference is deemed statistically significant. How improbable? In the 1920s, the great British statistician, Sir Ronald Fisher , said that he favored a 5 percent threshold. So 5 percent became the Holy Grail.
Unfortunately, the establishment of a 5 percent hurdle for statistical significance has had the perverse effect of encouraging researchers to do whatever it takes to hit that target. An easy way to do that is to turn the scientific method on its head. Instead of starting with a well-defined theory, begin with the data. Sift through a large dataset looking for statistically significant patterns. This backward process goes by a variety of names, including data mining, data dredging, and fishing expeditions.
Computer algorithms are terrible at identifying logical theories and selecting appropriate data to test these theories but they are really, really good at rummaging through data for statistically significant relationships. The problem is that discovered patterns are usually coincidental. They vanish when tested with fresh data—a disappearing act which contributes to the replication crisis that is undermining the credibility of scientific research. A 2015 survey by Nature, one of the very best scientific journals, found that more than 70 percent of the researchers surveyed reported that they had tried and failed to reproduce another scientist’s experiment and more than half had tried and failed to reproduce some of their own studies!
When I was a young assistant professor at Yale, one of my senior colleagues, Nobel Laureate James Tobin, wryly observed that the bad old days when researchers had to do calculations by hand were actually a blessing. The effort was so great that people thought hard before calculating. They put theory before data. Today, with terabytes of data and lightning-fast computers, it is too easy to calculate first, think later.
Ironically, a Yale professor named Aleh Tsyvinski recently went on a massive data mining expedition with Yukun Liu (then a graduate student, now a professor himself at the University of Rochester). Even more ironically, Tsyvinski is the Arthur M. Okun Professor of Economics at Yale, an endowed chair named after Tobin’s close friend Arthur Okun. Okun was a consummate believer in economic analysis that makes sense and is useful.
Tsyvinski and Liu set out to find statistical correlations between bitcoin returns and other variables. There is no logical reason for bitcoin returns to be affected by anything other than guesses about future bitcoin returns. Unlike bonds that yield interest, stocks that yield dividends, apartments that yield rent, businesses that yield profits, and other real investments, bitcoin doesn’t yield anything at all, so there is no compelling way to value bitcoin the way investors can value bonds, stocks, apartments, businesses, and other real investments.
Attempts to correlate bitcoin returns with real economic variables are destined to disappoint. Yet Liu and Tsyvinski did exactly that and their disappointments are instructive. To their credit, they did not hide their failed results; they were transparent about how many variables they considered. Eight hundred and ten! They couldn’t and wouldn’t have done this data mining if they had been forced to do the calculations by hand. With powerful computers, data mining was easy… too easy.
When considering so many variables, most researchers use a stricter standard than the usual 5-percent hurdle for statistical significance. Liu and Tsyvinski adjusted the hurdle, but in the wrong direction. They considered any association with less than a 10 percent chance of occurring by luck to be statistically significant.
There is no logical reason why the vast majority of their variables should be related to bitcoin returns: the Canadian dollar–U.S. dollar exchange rate, the price of crude oil, stock returns in the healthcare industry, and stock returns in the beer industry. Overall, Liu and Tsyvinski estimated 810 correlations between bitcoin returns and various variables, and found 63 relationships that were statistically significant at the 10 percent level. This is somewhat fewer than the 81 statistically significant relationships that would be expected if they had just correlated bitcoin returns with random numbers.
Owen Rosebeck and I redid their analysis to see how many of these 63 relationships held up during the 14 months after they completed their study. Seven correlations continued to have the same signs and be statistically significant out-of-sample. Five of these reproducible correlations were for equations using bitcoin returns to predict Google searches for the word bitcoin, which are among the few logically plausible relationships they considered. Ironically, this finding is an argument against data mining and for focusing on logical relationships because these are the ones that are likely to endure.
For the hundreds of other relationships they considered, fewer than 10 percent were significant in-sample, and fewer than 10 percent of this 10 percent continued to be significant out-of-sample. Bitcoin returns had a statistically significant negative correlation with stock returns in the paperboard-containers-and-boxes industry that was confirmed with out-of-sample data. Should we conclude that a useful, meaningful relationship has been discovered? Or should we conclude that these findings are what might have been expected if all of the estimated equations had used random numbers with random labels instead of real variables?
The authors didn’t attempt to explain the patterns that they found: “We don’t give explanations, we just document this behavior.” Patterns without explanations are treacherous. A search for patterns in large databases will almost certainly discover some but the discovered patterns are likely to disappear when they are used to make predictions. What is the point of documenting temporary, serendipitous patterns?
We submitted our paper to a well-regarded journal and experienced a very interesting outcome. One of the reviewers said that we were “incorrect” when we tested only the 63 relationships that Liu and Tsyvinski found to be statistically significant. We should also have used the out-of-sample data to test the 747 relationships that were not statistically significant in-sample!
What if you found a significant result where they found nothing? That is worth knowing. This mistake is repeated several times throughout the paper.
What a wonderful example of how researchers can be blinded by the lure of statistical significance. Here, in a study of the relationship between bitcoin returns and hundreds of unrelated economic variables, this reviewer, respected enough to be used as a reviewer by a good journal, believes that it is worth knowing if statistically significant relationships can be found in some part of the data even if we already know that they don’t replicate in the rest of the data.
That is the pure essence of the replication crisis—the ill-founded belief that a researcher’s goal is to find statistically significant relationships, even if the researcher knows in advance that the relationships do not make sense and do not replicate.
A crucial step for tamping down the replication crisis is for researchers to recognize that statistical significance is not the goal. Real science is about real relationships that endure and can be used with fresh data to make useful predictions.
You may also enjoy:
Bitcoin is a classic bubble investment. In large data sets, correlations are easy to find. Useful relationships are more elusive. (Gary Smith)
Why it’s so hard to reform peer review. Reformers are battling numerical laws that govern how incentives work. Know your enemy! (Robert J. Marks)