AI & MACHINE LEARNING

Overfitting

Phenomenon in which a machine learning model fits the training-set sampling noise excessively, losing generalization ability. Detected by the gap between training error (low) and test error (high). Underfitting is the opposite problem.

Extended definition

Overfitting is the phenomenon in which a machine learning model fits the training-set sampling noise excessively, capturing spurious patterns that do not generalize to new data. Operationally, it is detected by the gap between training error and test error: errortesterrortrain\text{error}_{\text{test}} \gg \text{error}_{\text{train}}. The opposite problem, underfitting, occurs when the model is too simple to capture real data structure — high error in both train and test. The sweet spot is regions of moderate complexity where the model generalizes. Hawkins (2004) synthesized the literature on the problem in computational chemistry and applied statistics, articulating that overfitting is a general phenomenon, not restricted to neural networks or modern ML — it affects polynomial regression, unpruned decision trees, and any model with flexibility disproportionate to sample size. Hastie, Tibshirani, and Friedman (2009) formalize the bias-variance trade-off as a mathematical decomposition of error: increasing complexity reduces bias but increases variance, and overfitting is the regime where high variance dominates.

When it applies

The overfitting concept applies in any supervised predictive modeling project — classification, regression, time series forecasting, recommender systems. It is particularly relevant in: deep neural networks (high representational capacity easily exceeds data size), random forests and gradient boosting with many trees, high-degree polynomial regression, models with many features on small datasets. Standard diagnosis is to monitor validation-set error during training — when validation error starts rising while training error continues falling, the model is entering overfitting.

When it does not apply

The concept does not apply in purely descriptive models without generalization goals. It does not apply directly in classical inferential statistics focused on estimating population parameters (overfitting is a predictive problem, not an inferential one — though analogous via excess variables in regression). In big-data problems with npn \gg p (many observations, few features), overfitting is less urgent, though not impossible. In modern deep learning with adequate regularization and huge datasets, classical overfitting has been partially replaced by other problems (memorization, representational bias, distribution shift).

Applications by field

Computer vision: overfitting in deep networks controlled by data augmentation, dropout, batch normalization, early stopping. — NLP: large pre-trained models (BERT, GPT) fine-tuned on small datasets carry high risk — explicit regularization and low learning rate are standard. — Health and biomedical sciences: small clinical datasets require strict cross-validation and external validation to rule out overfitting. — Quantitative finance: backtesting with overfitting is endemic practice — trading rules “that worked” historically frequently fail in production.

Common pitfalls

The first pitfall is diagnosing overfitting only by test performance — when the test set has been touched during development, it becomes implicit validation, and overfitting persists under the appearance of a “good model”. The second is confusing overfitting with underfitting: both produce poor production performance but require opposite actions (more regularization vs. more capacity). The third is assuming that regularization (L1, L2, dropout) eliminates overfitting — it reduces, does not eliminate; the gap can persist even with aggressive regularization. The fourth is ignoring hyperparameter overfitting: running many tuning experiments on the validation set turns it into implicit training. The fifth is failing to distinguish overfitting from distribution shift: a well-fitted model may fail in production because the distribution changed, not because it overfit — the fix is different.

Last updated —