Logistic regression — Glossary Aria Research

Extended definition

Logistic regression is a statistical model for a categorical dependent variable that estimates the probability of belonging to a category as a logistic function of predictors. For a binary response ( $Y \in \{0, 1\}$ ), the canonical form is:

\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k

where $p = P(Y=1 \mid X)$ and $\log\left(\frac{p}{1-p}\right)$ is the logit (log-odds). Coefficients $\beta_i$ are interpreted as change in log-odds per unit increase in $x_i$ , or — after exponentiation — as odds ratios ( $e^{\beta_i}$ = odds ratio). Cox (1958, JRSS B) formalized logistic regression in the modern framework; Hosmer, Lemeshow, and Sturdivant (2013, Applied Logistic Regression) consolidated the standard technical reference. Variants include multinomial (more than 2 categories with no natural order — softmax), ordinal (ordered categories — proportional odds), and conditional (matched case-control studies).

When it applies

Logistic regression applies in any problem with a categorical outcome that must be modeled as a function of continuous or categorical predictors: disease presence/absence in epidemiology, success/failure of intervention, vote/abstention, default/non-default in credit, binary classification in classical ML. It is the standard technique for association analysis in case-control studies (analytic epidemiology) and pairs well with odds-ratio confidence intervals. In ML, logistic regression serves as a strong baseline before more complex models (random forest, gradient boosting, neural networks) — often hard to beat on tabular problems with well-engineered features.

When it does not apply

It does not apply for continuous dependent variables — use linear regression. It does not apply for ordinal variables with more than 4-5 categories if the distance between categories is informative — ordinal models or linear regression may be more appropriate. It does not apply for outcomes with extreme class imbalance ( $<5\%$ of one class) without adjustments (Firth correction, downsampling, weighting). It does not apply to data with strong dependence structure (repeated measures, clustering) without extensions: mixed models (GLMM), GEE, or hierarchical models are appropriate. In modern ML with high-dimensional features and non-linear relationships, simple logistic regression is often suboptimal.

Applications by field

— Epidemiology: standard for odds ratios in case-control studies; confounder adjustment via covariate inclusion. — Finance: credit scoring, default prediction, fraud — logistic regression is a regulatory baseline in many contexts. — Marketing: churn modeling, conversion, campaign response — coefficient interpretability is a differentiator. — Applied ML: baseline for tabular classification problems before non-linear models.

Common pitfalls

The first pitfall is interpreting $\beta_i$ as direct effect on $p$ — it is effect on the logit; change in $p$ depends on the starting value (sigmoid curve is non-linear). The second is confusing odds ratio ( $e^{\beta}$ ) with relative risk — they coincide when the outcome is rare ( $<10\%$ ) but diverge in common outcomes; reporting as relative risk when it is odds ratio is a frequent error in epidemiology. The third is not checking assumptions: logit linearity in continuous predictors, absence of severe multicollinearity (VIF), independence of observations. The fourth is including variables based only on univariate $p < 0.05$ : inclusion should follow theoretical framework, not fishing. The fifth is interpreting pseudo- $R^2$ (Nagelkerke, McFadden) as the linear-regression $R^2$ — they are not equivalent; typical Nagelkerke values are between 0.1 and 0.4 even in good models.