DATA & STATISTICS

Linear regression

Statistical model estimating the linear relationship between a dependent variable and one or more independent variables. Methodological foundation of much of applied statistics and pedagogical entry point for more complex predictive models.

Extended definition

Linear regression is the statistical model that estimates the linear relationship between a dependent variable yy and one or more independent variables x1,x2,,xpx_1, x_2, \ldots, x_p. The canonical formulation of the simple model (a single predictor) is:

yi=β0+β1xi+ϵiy_i = \beta_0 + \beta_1 x_i + \epsilon_i

where β0\beta_0 is the intercept, β1\beta_1 the slope, and ϵi\epsilon_i the stochastic error, typically assumed to be independent, normally distributed with zero mean and constant variance. Classical estimation uses ordinary least squares (OLS), which minimizes the sum of squared residuals. The concept dates back to Galton (1886), in a study on hereditary stature — the word “regression” derives from the observation that children of tall parents tended to be closer to the population mean (“regressing to mediocrity”). The modern multivariate formulation extends to multiple predictors simultaneously, with evaluation by R2R^2, tt-tests for coefficients, FF for the overall model, and residual diagnostics.

When it applies

Linear regression is appropriate when there is reason to expect an approximately linear relationship between predictors and response, with residuals close to normality and constant variance. It is the starting model for any quantitative analysis of relationships among continuous variables, and the base on which more complex techniques (logistic regression, mixed models, regularized regression, hierarchical models, SEM) are built. Applications span the statistical practice of practically all empirical sciences.

When it does not apply

Linear regression does not apply when the relationship is strongly non-linear without possible transformation, when the response variable is categorical (logistic regression is the alternative) or count-based (Poisson or negative binomial regression), when there is severe autocorrelation in residuals (time series require specific models), or when there is serious assumption violation without solution via transformation. In problems with many correlated predictors, OLS becomes unstable; ridge, lasso, or elastic net are alternatives. For causal inference in observational data, linear regression alone is insufficient — quasi-experimental methods or causal DAGs are needed.

Applications by field

Health and biomedical sciences: analysis of factors associated with continuous outcomes (blood pressure, quality-of-life score, biomarkers). — Applied social sciences: models of wages, academic performance, satisfaction, with covariate control. — Engineering and physics: fitting linear physical models to experimental data; instrument calibration. — Economics and finance: risk factor models, time series regressions (with autocorrelation correction).

Common pitfalls

The first pitfall is assuming linearity without visual inspection — plots of residuals versus fitted values are essential. The second is ignoring multicollinearity among predictors — highly correlated variables inflate standard errors and make individual coefficients non-interpretable (variance inflation factor, VIF, is the standard diagnostic). The third is interpreting the coefficient as a causal effect in observational data without controlling for confounders — an endemic risk in social sciences. The fourth is trusting high R2R^2 without checking assumptions: R2R^2 of 0.90 with heteroscedastic residuals or systematic patterns indicates poor fit, not a good model. The fifth is extrapolating outside the observed range — a linear model fitted on x[10,50]x \in [10, 50] has no guarantee of validity at x=100x = 100.

Last updated —