Feature Leakage – Causes and Remedies

Do your models seem too accurate? They might be.

Feature leakage, a.k.a. data leakage or target leakage, causes predictive models to appear more accurate than they really are, ranging from overly optimistic to completely invalid. The cause is highly correlated data – where the training data contains information you are trying to predict.

How to Minimize Feature Leakage:

  1. Remove data that could not be known at the time of prediction.
  2. Perform data cross-validation.
  3. If you suspect a variable is leaky, remove it and run again.
  4. Hold back a validation data set.
  5. Consider near-perfect model accuracy a warning sign.
  6. Check variables of importance for overly predictive features.

If you are a Squark user, you’ll be happy to know that our AutoML identifies and removes highly correlated data before building models. Squark uses cross-validation and holds back a validation data set as well. Squark always displays accuracy and variables of importance for each model.

How are Supervised and Unsupervised Learning Different?

How Are Supervised and Unsupervised Learning Different?

Showing the way vs. stumbling in the dark – there are applications for both.

Supervised

Supervised Learning shows AutoML algorithms sets of known outcomes from which to learn. Think of classroom drills, or giving a bloodhound the scent.

Supervised learning relies on training data containing labels for data columns (features). Known outcomes must be included for the columns to be predicted on fresh data.

Use Supervised Learning for…

  • Performance-based predictions
  • Scoring the likelihood of things to happen
  • Forecasting outcomes

Classifications where there are two (binary) or more (multi-class) possibilities are use cases for supervised learning . Regressions—predicting scalar numerical values such as forecasts—are also suited to supervised learning.

Unsupervised

Unsupervised learning happens when AutoML algorithms are given unlabeled training data and must make sense of it without any instruction. Such machines “teach themselves” what result to produce.

Unsupervised learning algorithms look for structure in the training data, like finding which examples are similar to each other and grouping them into clusters.

Use Unsupervised Learning for…

  • Understanding co-occurrence
  • Detecting hidden data relationships
  • Extracting data

Clustering, market basket analyses, and anomaly detection are common use cases for unsupervised learning.