‘Data leakage’ is a ubiquitous term associated with predictive modeling and is a prevalent occurrence in most Kagglers dictionary.
If your model is performing too well, reflect on your methods before popping open the champagne.
Predictive modeling & Cross-validation
Predictive modeling focuses on making predictions on novel data using a model that learns the pattern from the training data.
This is a challenging problem. It’s hard because the model cannot be evaluated on something which is not available.
Hence, the existing training data is leveraged for learning the patterns and, at the same time, testing the capabilities of the model to accurately predict an unseen dataset. This is the principle that underlies cross-validation and more sophisticated techniques that try to reduce the variance in this estimate.
Data is said to have leaked when, during the model training, the model unintentionally or mistakenly has access to data which it would not have in real-life scenarios. This would cause the predictive scores (metrics) to overestimate the model’s utility when run in a production environment.
Data Leakage happens when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict— Daniel Gutierrez, Ask a Data Scientist: Data Leakage
This additional information can allow the model to learn or know something that it otherwise would not know and, in turn, invalidate the estimated performance of the model being constructed.
How to detect leakage
“Any deal which is too good to be true is a questionable deal.” An easy way to probe data leakage is to check the model is achieving performance that seems a little too good to be true.
As a general, if the model is too good to be true, we should get suspicious. The model might be somehow memorizing the feature-target relations instead of learning and generalizing.
For example — If a model can predict lottery numbers or pick stocks with very very high accuracy (something which goes beyond rationality).
Genesis of Data leakage
- Leaky Predictors: During feature engineering, any features updated (or created) after the target value is realized should be excluded. Because when we use this model to make new predictions, that data won’t be available to the model.
- Pre-processing: A pervasive error that people make is to leak information in the data pre-processing step of machine learning. It is essential that these transformations only have knowledge of the training set, even though they are applied to the test set as well.
- Example 1(Normalization) — Many models require normalization of the input data, especially neural networks. Commonly, data is normalized by dividing it by its average or maximum. If this is done using the average or maximum of the overall data set, then information from the test set will now be influencing the training set. For this reason, any normalization should be applied on a subset basis.
- Example 2 (PCA) — The PCA model should be fit only on the training set. Then, to apply it to your test set, the
transformmethod of PCA should be called (in the case of a scikit-learn model) on the test set. If instead, the pre-processor is fit on the entire dataset, information from the test set will be leaked, since the parameters of the pre-processing model will be fitted with knowledge of the test set.
- Example 3 (Missing Value Imputation) — If the Imputer for missing values is run before calling train_test_split. The data will leak subtly as the test data will be used for imputing the training data. The end result? The model will get outstanding validation scores, giving great confidence in it, but perform poorly when it is deployed to make decisions.
How to preclude data leakage
1. Preventing Leaky Predictors
There is no one-size-fits-all solution that prevents leaky predictors. It requires knowledge about data, case-specific inspection, and common sense. However, leaky predictors frequently have high statistical correlations to the target. So, to screen for possible leaky predictors, look for columns that are statistically very highly correlated to the target.
2. Ameliorate Validation Strategies
If the validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of pre-processing steps. This is easier if the scikit-learn Pipelines is used. When using cross-validation, it’s even more critical that the pipelines are used and pre-processing is done inside the pipeline.
3. Hold Back a Validation Dataset
Split the training dataset into train and validation sets and store away the validation dataset. Once you have completed your modeling process, and actually created your final model, evaluate it on the validation dataset.
This can give you a sanity check to see if your estimation of performance has been overly optimistic and has leaked.
In addition to the above, machinelearningmastery mentions the following couple of tips to combat data leakage:
- Temporal Cutoff. For the time-series data, remove all data just prior to the event of interest, focusing on the time about a fact or observation is learned rather than the time the observation occurred.
- Add Noise. Add random noise to input data to try and smooth out the effects of possibly leaking variables.
Data leakage can be a multi-million dollar mistake in many data science applications. Careful separation of training and validation data is the first step, and pipelines can help implement this separation. Leaking predictors are a more frequent issue, and leaking predictors are harder to track down.
The meaning of data leakage can be interpreted as:
- Business Owners: Don’t be afraid to hold your engineers accountable for performance on out-of-sample data. Out-of-sample performance is ultimately what determines the value of a model. Hold back some of your own data from engineers and be the impartial adjudicator on the performance of the model.
- Engineers: If your model is performing too well, it is very likely that some form of data leakage is occurring. Make sure you understand your data, rather than view it as a homogeneous pile of numbers. Some features may be ex-ante indicators of a target and should be excluded altogether.
Once you have checked the above methods, making sure there is no leakage and your model is still performing well, you can open your champagne now 😀