P-value — Glossary Aria Research

Extended definition

The p-value is the probability, computed under the null hypothesis $H_0$ , of obtaining a test statistic at least as extreme as the one observed in the data. In notation:

p = P(T \geq t_{\text{obs}} \mid H_0)

where $T$ is the test statistic and $t_{\text{obs}}$ the sample value. The concept originated with Fisher (1925) in the logic of significance testing and was later folded into the Neyman-Pearson decision framework — two logically distinct traditions that contemporary practice often blends. The p-value is not the probability that the null hypothesis is true, nor the probability that the result is due to chance, nor the size of the effect. It is only a measure of incompatibility between the data and the null model. In 2016 the American Statistical Association issued a formal statement — its first in 177 years — articulating six principles on the use and abuse of p-values, in direct response to the reproducibility crisis in the empirical sciences.

When it applies

The p-value is appropriate in formal hypothesis tests with a well-specified null hypothesis stated before data collection, in designs with preregistration or solid theory. Useful for communicating the quantity of evidence against a specific reference model — not for confirming an alternative hypothesis. Responsible interpretation recognizes that p-values are continuous: a p of 0.049 and a p of 0.051 carry essentially the same evidence, despite the 0.05 threshold suggesting otherwise.

When it does not apply

It does not serve Bayesian inference, where posterior probability is the quantity of interest. It does not serve large-scale exploratory analysis without correction for multiple comparisons — testing 100 hypotheses at random produces ~5 false positives at $\alpha = 0.05$ . It does not replace effect size or confidence interval in results reporting. Studies with very large samples can yield tiny p-values for clinically irrelevant effects; studies with small samples can mask real effects with high p. The ASA explicitly recommends abandoning the “statistically significant / not significant” dichotomy as an editorial criterion.

Applications by field

— Biomedical sciences: center of the evidence framework in clinical trials; primary target of post-reproducibility-crisis reforms. — Social sciences and psychology: epicenter of the replication crisis; preregistration and correction for multiple tests now required by serious journals. — Engineering and quality control: technical use in statistical process control, with established conventions. — Machine learning: marginal use — performance metrics and cross-validation replace frequentist hypothesis testing.

Common pitfalls

The first pitfall is interpreting p as “probability that $H_0$ is true” — the transposed probability fallacy, formally incorrect. The second is p-hacking: testing multiple variables, models, or subgroups until p < 0.05 appears, without reporting the failed attempts. The third is HARKing (Hypothesizing After Results are Known): constructing a post-hoc narrative as if the hypothesis had been formulated in advance. The fourth is conflating statistical significance with practical relevance — at $n = 10{,}000$ , trivial differences yield microscopic p-values. The fifth is treating the 0.05 threshold as ontological truth rather than arbitrary convention — Fisher proposed 0.05 as “convenient,” not as a sacred line. The ASA recommends, instead of the binary, reporting effect size, confidence interval, and the context of the hypothesis.