Do your models seem too accurate? They might be.
Feature leakage, a.k.a. data leakage or target leakage, causes predictive models to appear more accurate than they really are, ranging from overly optimistic to completely invalid. The cause is highly correlated data – where the training data contains information you are trying to predict.
How to Minimize Feature Leakage:
- Remove data that could not be known at the time of prediction.
- Perform data cross-validation.
- If you suspect a variable is leaky, remove it and run again.
- Hold back a validation data set.
- Consider near-perfect model accuracy a warning sign.
- Check variables of importance for overly predictive features.
If you are a Squark user, you’ll be happy to know that our AutoML identifies and removes highly correlated data before building models. Squark uses cross-validation and holds back a validation data set as well. Squark always displays accuracy and variables of importance for each model.