In the 1990s, functional magnetic resonance imaging (fMRI) — imaging the brain in action via blood flow—seemed like a dream come true. Medical and social science researchers who flocked to use it are not going to be happy with a recent study of its limitations: There was little meaningful agreement among seventy research teams from around the world about what their results meant.
In an article aptly titled “Seventy Teams of Scientists Analysed the Same Brain Data, and It Went Badly,” a neuroscientist fills us in:
The group behind the Nature paper set a simple challenge: they asked teams of volunteers to each take the same set of fMRI scans from 108 people doing a decision-making task, and use them to test nine hypotheses of how brain activity would change during the task. Their goal was simply to test how many teams agreed on which hypotheses had significant evidence and which did not. The Neuroimaging Analysis Replication Study (NARPS) was born.
The task was simple too, cutting down on the complexity of the analysis. Lying in the scanner, you’d be shown the two potential outcomes of a coin-flip: if it comes up heads, you’d lose $X dollars; if tails, you’d win $Y dollars. Your decision is whether to accept or reject that gamble; accept it and the (virtual) coin is flipped, and your winnings adjusted accordingly. The clever bit is that the difference between the loss and win amount is varied on every trial, testing your tolerance for losing. And if you’re like most people, you have a strong aversion to losing, so will only regularly accept gambles where you could win at least twice as much as you lose. From this simple task sprung those nine hypotheses, equally simple. Eight about how activity in a broad region of the brain should go up or down in response to wins or losses; one a comparison of changes within a brain region during wins and losses. And pretty broad regions of the brain too — a big chunk of the prefrontal cortex, the whole striatum, and the whole amygdala. Simple task, simple hypotheses, unmissably big chunks of brain — simple to get the same answer, right? Wrong.Mark Humphries, “Seventy Teams of Scientists Analysed the Same Brain Data, and It Went Badly” at Medium (July 1, 2020)
None of the teams followed the same path in either gathering or analyzing the data. When the results were tallied, only one of the nine hypotheses (Hypothesis 5) was found to be significant by over 80% of the teams. Three hypotheses were considered significant by only about 5% of the researchers. That may mean that the majority thought them incorrect. But it is unsettling that the number, 5%, is the same as chance alone would assign. Then there were the 20% and 35% of teams that reporting a significant effect for each of five hypotheses. As Humphries puts it, “Nine hypotheses: one agreed as correct; three rejected; five in limbo. Not a great scorecard for 70 teams looking at the same data.” The study’s results differed significantly from the “wildly over-optimistic” predictions of greater agreement.
That’s helpful to remember when social scientists engage in mindreading about, say, a tendency to violence, using fMRI data.
Some reactions from other science media:
● At Nature, a number of suggested fixes were offered:
The fact that each team’s results were so pipeline-dependent is highly problematic, particularly because the exact configuration of analytical pipelines is often poorly described in research articles. Moreover, sensitivity analyses — which assess how different pipeline choices might affect an experiment’s outcome — are rarely performed in neuroimaging. However, Botvinik-Nezer and colleagues offer several reasonable suggestions for addressing the concerns that their work will raise.Martin Lindquist, “Neuroimaging results altered by varying analysis pipelines” at Nature (May 20, 2020)
● We learn from The Scientist that the problems are longstanding:
Neuroimaging, specifically functional magnetic resonance imaging (fMRI), which produces pictures of blood flow patterns in the brain that are thought to relate to neuronal activity, has been criticized in the past for problems such as poor study design and statistical methods, and specifying hypotheses after results are known (SHARKing), says neurologist Alain Dagher of McGill University who was not involved in the study.Ruth Williams, “Research Teams Reach Different Results From Same Brain-Scan Data” at The Scientist (May 20, 2020)
One of the NARPS paper’s authors, neuroscientist Tom Schonberg of Tel Aviv University, is quoted in The Scientist as admitting, “it wasn’t easy seeing this variability, but it was what it was.”
● Another of the researchers, Doug Schultz, discussed the findings for his university community:
As the predominant method of brain mapping, fMRI has helped researchers investigate questions of cognition, emotion and function since its current form emerged in the early 1990s.
But interpreting the colorful readouts of gray matter is never as black-and-white as TV and film have made it seem, Schultz said…
Reason No. 1? The sheer number of decisions and steps involved in what is an inherently involved process, Schultz said. During the study — just as they would be in conducting their own studies — teams were free to decide what magnitude of blood flow actually constituted statistically meaningful brain activity…
That’s all before getting into logistical issues and the choices they spawn. Brains are rarely the same size or shape, so comparing them requires standardizing those variables — a process that, ironically, can be done to varying degrees and in any number of ways. There’s also the fact people don’t hold their heads perfectly still while undergoing an MRI, which introduces its own complications. If a person’s head shifted two millimeters during a certain frame, should the data be kept or discarded? Is that distance more or less important than the number of shifts? Just how many of those are acceptable?Scott Schrage, “Gray matter? Study finds differing interpretations of brain maps” at University Communication, University of Nebraska–Lincoln (June 16, 2020)
Humphries (pictured) closes his article at Medium by reminding us that the rest of neuroscience suffers from similar issues:
These crises should give any of us working on data from complex pipelines pause for serious thought. There is nothing unique to fMRI about the issues they raise. Other areas of neuroscience are just as bad. We can do issues of poor data collection: studies using too few subjects plague other areas of neuroscience just as much as neuroimaging. We can do absurdly high correlations too; for one thing if you use a tiny number of subjects then the correlations have to be absurdly high to pass as “significant”; for another most studies of neuron “function” are as double-dipped as fMRI studies, only analysing neurons that already passed some threshold for being tuned to the stimulus or movement studied.Mark Humphries, “Seventy Teams of Scientists Analysed the Same Brain Data, and It Went Badly” at Medium (July 1, 2020)
We’ll know in a decade or so how many fixes were attempted and with what result. Meanwhile, the take home point is that fMRI is not is a magical mind reading machine. So when we read stories about fMRI studies of compassion, jealousy, vengefulness, or voter intentions, we had best keep these findings in mind.
Note: The paper, Botvinik-Nezer, R., Holzmeister, F., Camerer, C.F. et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582, 84–88 (2020), https://doi.org/10.1038/s41586-020-2314-9 requires a subscription but here’s the Abstract:
Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset2,3,4,5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.
Why brain activity doesn’t reveal our minds. There is poor correlation between different scans of even the same person’s brain, experienced researchers say.
Up! Up! Up! Keep those hippocampi up! The lighter side of neurobabble about our brains and ourselves. (Michael Egnor)