Extended definition
Feature engineering is the set of practices for transforming raw data into informative features for machine learning models. It includes: encoding categorical variables (one-hot, ordinal, target encoding, embeddings), normalization and standardization (min-max, z-score, robust scaler), creation of derived features (interactions, polynomials, temporal aggregations, calendar-aware decompositions), transformations to adjust distribution (log, Box-Cox, Yeo-Johnson), missing value handling (simple or multiple imputation), selection of a relevant subset (filter, wrapper, embedded), and dimension reduction (PCA, autoencoders). Domingos (2012, CACM) articulated in “A few useful things to know about machine learning” that effective features are often the dominant factor of practical ML performance — more than algorithm choice. Kuhn and Johnson (2019, Feature Engineering and Selection) consolidated the practical technical reference in book form. In modern deep learning (computer vision and NLP), part of feature engineering migrates to automatic representation learning via deep networks; in tabular data, manual engineering remains decisive.
When it applies
Feature engineering applies in any supervised ML project on tabular data — Kaggle competitions, production systems in health, finance, marketing. It is a typical stage between exploratory analysis and modeling. It applies especially when the goal is interpretability: well-engineered features in simple models (logistic regression, GAM) often compete with black-box models in performance and win in regulatory explainability. It applies in time series data (lags, moving averages, seasonality), in text (TF-IDF before embeddings), in signals and images when deep learning is not viable (computational cost, small data).
When it does not apply
It does not apply to deep learning over images, audio, or unstructured text — representation learning via CNNs, Transformers, and pretrained models replaces much of the manual work. It does not apply as a crutch for a poorly chosen model: artificial features attempting to compensate for fundamental model limitations (linearity where the relation is non-linear, for example) are weak — switching the model is more effective. It does not apply before train/test split without care: features based on data aggregation (mean, count) must be computed ONLY on training to avoid leakage. It does not replace quality data: sophisticated features on noisy data amplify noise.
Applications by field
— Health: derived clinical features (Charlson comorbidity index, APACHE score), temporal aggregations of vital signs. — Finance: lags and moving averages, financial ratios, momentum and volatility indicators in risk and trading. — Marketing: RFM features (Recency, Frequency, Monetary value), time×channel interactions in purchase propensity. — Classical NLP: TF-IDF, n-grams, linguistic features (POS-tags, sentiment) before the era of dense embeddings.
Common pitfalls
The first pitfall is data leakage: features based on information posterior to the prediction timestamp (temporal leakage) or on aggregations computed over the full dataset (including test) spuriously inflate performance. The second is one-hot encoding of very high cardinality (hundreds to thousands of categories) without grouping or target encoding — explodes dimensionality and degrades performance. The third is failing to normalize before scale-sensitive algorithms (k-means, SVM, neural networks) and losing convergence or stability. The fourth is selecting features on the full dataset (including validation/test) — produces biased selection; selection must occur within CV or on training only. The fifth is building features without understanding the domain: a complex feature derived by statistical intuition can capture irrelevant artifacts, while a simple feature based on domain knowledge is often more informative.