A recent Nature commentary raised a provocative view on the use of p-value, a highly debated subject in science at the moment. It asks the scientific community to consider retiring the concept due to its misleading nature and a toxic culture of misuse. Over 800 signatories around the globe provided their endorsement to this manifesto.
What is a p-value?
For those aren't familiar with hypothesis testing in statistics, study results that show a p-value less than 0.05 will be declared as “statistically significant,” meaning that the null hypothesis is likely untrue, with only less than 5% probability that it is true.
The first uses of p-value date back to more than three centuries ago, for the purpose of calculating the statistical significance of the human sex ratio at birth. It soon became a popular measure to test various hypotheses in different fields of scientific research, anything from physics to psychology.
Nowadays, statistical significance has become one of the most important criteria in evaluating the validity of study results. Peer reviewers often resort to the p-values to determine whether a result is worth being published in a scientific journal.
How can a p-value be misunderstood/misinterpreted?
To start, one need not look far: the definition of p-value is the probability of the null hypothesis is true. Say a certain statistical analysis reveals that your data show p<0.05. But proving the null hypothesis untrue, whether at a chance of over 95% or even 100%, does not necessarily mean that your hypothesis is correct.
According to the Nature commentary, a thorough analysis of 791 scientific publications revealed that the label of "statistically significant” is too often misinterpreted to mean “the study worked”.
The authors of the commentary also referred to the data and analyses from two real-life clinical investigations, which were intended to study the side effects of one anti-inflammatory medication. By restricting the 95% confidence interval, one study concluded with a statistically significant outcome while the other did not, even though both of them led to the same risk ratio and the only difference between them is how the study chose their interval span.
In this case, the two sets of results, one declared to be statistically significant and the other not statistically significant, aren't necessarily contradictory. What's more, the mislabeling of results can cause genuine effects to be dismissed, and make non-relevant results appear to be prominent.
The bigger problem: "p" hacking
It is not uncommon that scientists skewed the design of their experiments in order to obtain "statistical significance" or lowest p-value as possible, even to do so will defy major scientific principles.
Alternatively, after experiments are conducted and all data are collected, specific statistical analysis methods can be selected, not necessarily based on their scientific merit, to generate small p-values. Data cherry-picking is an equivalently notorious offense: including only the results that would make statistical analysis look good. The manipulation of statistics leads to serious outcomes, ranging from poor reproducibility of study data, false claims, to the errosion of the public's trust in science.
The rationale behind this type of behavior? Given the fact that the "publish or perish" doctrine governs the academic universe, scientists feel the pressure to have their studies published, which is more likely to happen if they have by "statistical significanct" results.
The bottom line
This is not the first time, a large number of scientists are frustrated with the misinterpretation of statistical analysis and the misuse of p-values. Back in 2017, a group of 72 researchers and statisticians had already voiced their concerns over this matter.
In response to the commentary, some lauded the efforts behind the manifesto while others are highly critical about the provocative argument and the gesture behind it, worrying that the abandonment of p-values would give rise to unchecked bias and bad science.
Solutions to the p-value problem, as proposed in the correspondence to the commentary, include adding other metrics, modifying the test for the hypothesis, and lowering the p-value acceptance to enhance reproducibility.
In the end, numbers or statistics did not fail us. Instead, the problem is about where the scientific community should place its values. Should we continue to trust a rigid binary system (statistically significant or not statistically significant) that has been unintentionally or purposefully applied to misinterpret research findings, or start searching for a new approach or approaches that treat evidence as a continuum and discourage cheating? The answer couldn't be more obvious.
How Scientists Manipulate Research With P-Value (Seeker)