Statistical power — Glossary Aria Research

Extended definition

Statistical power is the probability that a statistical test correctly rejects the null hypothesis $H_0$ when it is, in fact, false. In notation, Power = $1 - \beta$ , where $\beta$ is the type II error probability (false negative: accepting a false $H_0$ ). The concept was formalized in the Neyman-Pearson framework in the 1930s and operationalized for behavioral sciences by Jacob Cohen in the canonical manual Statistical Power Analysis for the Behavioral Sciences (1988), which established the convention of $1 - \beta = 0.80$ as the minimum acceptable standard. Power depends on four interrelated factors: expected effect size, sample size, significance level $\alpha$ (typically 0.05), and data variance. A priori analysis computes the $n$ needed to detect an expected effect with desired minimum power. Post hoc analysis (computing power from the observed effect) is mathematically possible but widely criticized — it provides no information beyond the $p$ -value.

When it applies

A priori power analysis is mandatory in serious research design: sample size planning, funding proposals, study preregistration. In preregistration contexts (OSF, AsPredicted, ClinicalTrials.gov), power analysis is explicitly required. It is essential in clinical trials, where undersampling is ethically questionable (exposing participants to procedures without real chance of detecting an effect) and oversampling is wasteful. Software such as G*Power, pwr (R), and statsmodels (Python) offers implementations for the most common scenarios. In meta-analysis, aggregate power from multiple small studies can justify combination where individual studies were underpowered.

When it does not apply

Post hoc power analysis (from observed effect, not expected) has questionable value and is discouraged by contemporary statisticians — Hoenig and Heisey (2001) showed that it merely echoes the $p$ -value. In exploratory designs without a specifically pre-formulated hypothesis, power analysis is meaningless — the object is to discover effects, not to confirm them. In Bayesian designs, analogous concepts (Bayesian power, Bayes factor projection) replace classical frequentist power. In big data problems, power becomes irrelevant because any non-null effect eventually reaches significance — discussion must shift to effect size and practical relevance.

Applications by field

— Clinical trials: power analysis is a regulatory requirement (FDA, EMA, ANVISA); $n$ calculation based on minimum clinically relevant effect. — Experimental psychology: the replication crisis documented the prevalence of underpowered studies; APA now requires a priori analysis. — Neuroscience: Button et al. (2013) documented average power of 8-31% in studies in the field; conclusion: much of the literature is non-replicable. — Education research: pedagogical interventions require large samples to detect typical small-to-moderate field effects.

Common pitfalls

The first pitfall is trusting “rules of thumb” for sample size instead of formal calculation — $n = 30$ or $n = 100$ does not replace analysis. The second is doing a priori analysis with overly optimistic expected effect (taken from literature inflated by publication bias) — produces inadequate sample when the true effect is more modest. The third is running post hoc analysis and interpreting as “study had low power, hence I did not reject $H_0$ ” — documented logical error. The fourth is ignoring variance and design: power in factorial ANOVA, hierarchical models, survival analysis requires specific calculations, not a simple formula. The fifth is confusing power with replicability: a well-powered study has higher chance of producing replicable results, but high power does not guarantee replicability — systematic biases not captured by the analysis may exist.