Missing data and multiple imputation — Glossary Aria Research

Extended definition

Treatment of missing data in quantitative research first requires classifying the missingness mechanism, a distinction formalized by Rubin (1976): MCAR (Missing Completely At Random) — probability of missingness is independent of any observed or unobserved variable; MAR (Missing At Random) — probability of missingness depends on observed variables, not unobserved ones; MNAR (Missing Not At Random) — probability of missingness depends on the missing variable itself (e.g., patients who drop out because they feel worse). Classical strategies have limits: complete-case analysis (listwise deletion) discards information and produces bias under MAR; single imputation (mean, last value) underestimates uncertainty. Multiple Imputation (MI), formalized by Rubin (1987, Multiple Imputation for Nonresponse in Surveys, Wiley), generates $m$ complete datasets by sampling from the posterior distribution of missing values conditional on observed; analysis is repeated $m$ times; results are combined via Rubin’s rules:

\bar{Q} = \frac{1}{m}\sum_{l=1}^m \hat{Q}_l, \quad T = \bar{U} + \left(1 + \frac{1}{m}\right) B

where $\bar{Q}$ is the combined point estimate, $\bar{U}$ the mean within-imputation variance, $B$ the between-imputation variance, and $T$ the total variance (which adequately reflects uncertainty due to missingness). van Buuren (2018, Flexible Imputation of Missing Data, 2nd ed., Chapman & Hall/CRC) consolidated the modern practical reference; the mice package in R is the dominant implementation.

When it applies

Multiple imputation applies in studies with non-trivial proportion of missing data (generally > 5%) under plausible MAR mechanism. It applies in clinical trials with loss to follow-up, surveys with item non-response, administrative data with unfilled fields. It applies especially when $X_{\text{missing}}$ correlates with observed variables — mice exploits that structure. It applies in meta-analyses where primary data have substantive gaps. CONSORT requires reporting how missingness was handled in trials; STROBE in observational studies; ICMJE values transparency. Single imputation (mean, last observation carried forward) is accepted only in complementary sensitivity analyses, not as modern primary analysis.

When it does not apply

It does not apply under MNAR without specific modeling: standard MI assumes MAR; under MNAR, pattern-mixture models, selection models, or sensitivity analysis are alternatives. It does not apply as a substitute for good data collection: preventing loss is better than imputing. It does not apply to datasets with extreme proportion of missingness (>50%) where observed structure is insufficient to inform imputation. It does not apply blindly to variables with restricted logic (e.g., derived variables, conditional indicators): imputation must respect data constraints. It does not replace sensitivity analysis when MAR is a doubtful assumption: reporting primary analysis with MI + alternative analyses with different mechanisms is modern practice.

Applications by field

— Health: clinical trials with loss to follow-up; FDA standard requires transparent reporting. — Survey research: item non-response; PEW, Gallup use MI routinely. — Economics and social sciences: panel data with frequently non-random missingness. — Epidemiological research: electronic health records with missing fields; claims databases.

Common pitfalls

The first pitfall is assuming MCAR without testing: Little’s (1988) test and missingness patterns help, but confirmation requires domain knowledge. The second is using single imputation (mean, median, LOCF) as primary analysis: underestimates standard errors, inflates false-positive rate. The third is small $m$ : $m = 5$ was the classical recommendation, but $m = 20$ or more is modern practice for CI precision, especially with substantive missingness. The fourth is failing to include the outcome variable in imputation: imputing covariates without including the outcome introduces systematic bias toward null. The fifth is imputing created variables (interactions, aggregates) rather than raw variables: imputation should occur at the most primitive level; transformations come afterward.