Extended definition
Logistic regression is a statistical model for a categorical dependent variable that estimates the probability of belonging to a category as a logistic function of predictors. For a binary response (), the canonical form is:
where and is the logit (log-odds). Coefficients are interpreted as change in log-odds per unit increase in , or — after exponentiation — as odds ratios ( = odds ratio). Cox (1958, JRSS B) formalized logistic regression in the modern framework; Hosmer, Lemeshow, and Sturdivant (2013, Applied Logistic Regression) consolidated the standard technical reference. Variants include multinomial (more than 2 categories with no natural order — softmax), ordinal (ordered categories — proportional odds), and conditional (matched case-control studies).
When it applies
Logistic regression applies in any problem with a categorical outcome that must be modeled as a function of continuous or categorical predictors: disease presence/absence in epidemiology, success/failure of intervention, vote/abstention, default/non-default in credit, binary classification in classical ML. It is the standard technique for association analysis in case-control studies (analytic epidemiology) and pairs well with odds-ratio confidence intervals. In ML, logistic regression serves as a strong baseline before more complex models (random forest, gradient boosting, neural networks) — often hard to beat on tabular problems with well-engineered features.
When it does not apply
It does not apply for continuous dependent variables — use linear regression. It does not apply for ordinal variables with more than 4-5 categories if the distance between categories is informative — ordinal models or linear regression may be more appropriate. It does not apply for outcomes with extreme class imbalance ( of one class) without adjustments (Firth correction, downsampling, weighting). It does not apply to data with strong dependence structure (repeated measures, clustering) without extensions: mixed models (GLMM), GEE, or hierarchical models are appropriate. In modern ML with high-dimensional features and non-linear relationships, simple logistic regression is often suboptimal.
Applications by field
— Epidemiology: standard for odds ratios in case-control studies; confounder adjustment via covariate inclusion. — Finance: credit scoring, default prediction, fraud — logistic regression is a regulatory baseline in many contexts. — Marketing: churn modeling, conversion, campaign response — coefficient interpretability is a differentiator. — Applied ML: baseline for tabular classification problems before non-linear models.
Common pitfalls
The first pitfall is interpreting as direct effect on — it is effect on the logit; change in depends on the starting value (sigmoid curve is non-linear). The second is confusing odds ratio () with relative risk — they coincide when the outcome is rare () but diverge in common outcomes; reporting as relative risk when it is odds ratio is a frequent error in epidemiology. The third is not checking assumptions: logit linearity in continuous predictors, absence of severe multicollinearity (VIF), independence of observations. The fourth is including variables based only on univariate : inclusion should follow theoretical framework, not fishing. The fifth is interpreting pseudo- (Nagelkerke, McFadden) as the linear-regression — they are not equivalent; typical Nagelkerke values are between 0.1 and 0.4 even in good models.