There is a replicability crisis in science – unidentified “false positives” are pervading even our top research journals.
A false positive is a claim that an effect exists when in actuality it doesn’t. No one knows what proportion of published papers contain such incorrect or overstated results, but there are signs that the proportion is not small.
The epidemiologist John Ioannidis gave the best explanation for this phenomenon in a famous paper in 2005, provocatively titled “Why most published research results are false”. One of the reasons Ioannidis gave for so many false results has come to be called “phacking”, which arises from the pressure researchers feel to achieve statistical significance.
What is statistical significance?
To draw conclusions from data, researchers usually rely on significance testing. In simple terms, this means calculating the “p value”, which is the probability of results like ours if there really is no effect. If the p value is sufficiently small, the result is declared to be statistically significant.
Traditionally, a p value of less than .05 is the criterion for significance. If you report a p<.05, readers are likely to believe you have found a real effect. Perhaps, however, there is actually no effect and you have reported a false positive.
Many journals will only publish studies that can report one or more statistically significant effects. Graduate students quickly learn that achieving the mythical p<.05 is the key to progress, obtaining a PhD and the ultimate goal of achieving publication in a good journal.
This pressure to achieve p<.05 leads to researchers cutting corners, knowingly or unknowingly, for example by p hacking.
The lure of p hacking
To illustrate p hacking, here is a hypothetical example.
Bruce has recently completed a PhD and has landed a prestigious grant to join one of the top research teams in his field. His first experiment doesn’t work out well, but Bruce quickly refines the procedures and runs a second study. This looks more promising, but still doesn’t give a p value of less than .05.
Convinced that he is onto something, Bruce gathers more data. He decides to drop a few of the results, which looked clearly way off.
He then notices that one of his measures gives a clearer picture, so he focuses on that. A few more tweaks and Bruce finally identifies a slightly surprising but really interesting effect that achieves p<.05. He carefully writes up his study and submits it to a good journal, which accepts his report for publication.
Bruce tried so hard to find the effect that he knew was lurking somewhere. He was also feeling the pressure to hit p<.05 so he could declare statistical significance, publish his finding and taste sweet success.
There is only one catch: there was actually no effect. Despite the statistically significant result, Bruce has published a false positive.
Bruce felt he was using his scientific insight to reveal the lurking effect as he took various steps after starting his study:
- He collected further data.
- He dropped some data that seemed aberrant.
- He dropped some of his measures and focused on the most promising.
- He analysed the data a little differently and made a few further tweaks.
The trouble is that all these choices were made after seeing the data. Bruce may, unconsciously, have been cherrypicking – selecting and tweaking until he obtained the elusive p<.05. Even when there is no effect, such selecting and tweaking might easily find something in the data for which p<.05.