Train/validation/test split — Glossary Aria Research

Extended definition

Train/validation/test split is the partitioning of a dataset into three disjoint subsets with methodologically distinct functions in machine learning projects. The training set is used to fit model parameters (weights, coefficients, rules). The validation set is used to select hyperparameters (learning rate, tree depth, number of layers), compare candidate models, and make architectural decisions — without contaminating the final evaluation. The test set is used once only, at the end of the process, to estimate generalizable performance; touching the test set during development is one of the most documented sources of methodological bias in applied ML. Typical proportions vary: 60/20/20, 70/15/15, or 80/10/10 when the dataset is large. The paradigm has been textbook material since Hastie, Tibshirani, and Friedman (The Elements of Statistical Learning, 2009); Kohavi (1995) offered the first systematic comparison between fixed split and cross-validation for error estimation.

When it applies

Train/validation/test split is appropriate when the dataset is large enough that each subset preserves representativeness of the full distribution. Typical applications: supervised classification with thousands to millions of examples (computer vision, NLP, recommender systems), predictive regression with historical series, any supervised task in which the goal is to estimate out-of-sample performance. For temporal data, the split must respect chronological order (train always earlier than test) — random splitting leaks future information into the past. For grouped data (multiple observations per subject), the split must group by subject to prevent leakage.

When it does not apply

It does not apply in small datasets where reserving 20-30% for test would leave training with too few examples to learn — cross-validation (k-fold, leave-one-out) is the preferable alternative. It does not apply in purely exploratory projects without a predictive hypothesis. It does not replace external validation in a different domain — a model trained and tested on the same dataset captures local patterns that may not generalize to other populations or contexts. In time series with severe non-stationarity, simple splitting does not capture regime change — moving-window or prospective validation approaches are needed.

Applications by field

— Computer vision: standard split in datasets such as ImageNet, COCO, MNIST with community-predefined ratios. — NLP: datasets such as GLUE, SQuAD ship with official splits to ensure cross-paper comparability. — Health: clinical ML requires patient-level split (not sample-level), with external validation at a different hospital to trust the model. — Finance: strict temporal split for series forecasting; test always in a period after training.

Common pitfalls

The first pitfall is touching the test set during development — tuning hyperparameters by looking at the test set invalidates the performance estimate. The second is data leakage: features that encode future information (e.g., a variable computed with post-training data), splits that do not group by subject, normalization computed over the full dataset (it should be computed on training only and applied to the others). The third is assuming the standard ratio (80/10/10) always serves — in small datasets, k-fold cross-validation is more robust. The fourth is ignoring stratification in class-imbalanced problems — random splitting can produce a test set without minority-class examples. The fifth is trusting test-set performance as a production guarantee — distribution shift between test data and real production data is a primary source of degradation in deployed models.