#01 Your Data Does Not Have to Be Big

Your Data Does Not Have to Be Big

In fact, certain algorithms work well with smaller datasets.

Some models do require big datasets to deliver significant predictive power. But don’t assume that you need hundreds of feature columns or millions of rows. We’ve seen surprisingly usable accuracy from as few as a hundred rows and a dozen columns.

Data is Different. AI Must Be Too.


The whole point of using machine learning is that AI is better at finding patterns in data than legacy methods. Try different algorithms and see if you converge on reasonable prediction accuracy with whatever data you already have at hand. AI may or may not produce actionable predictions.  Regardless, you’ll learn a great deal about how much data and which features will ultimately make you most successful.

Get started now. Waiting for your fantasy, all-encompassing datasets will leave you permanently behind the curve.

#02 Feature Leakage – Causes and Remedies

Do your models seem too accurate? They might be.

Feature leakage, a.k.a. data leakage or target leakage, causes predictive models to appear more accurate than they really are, ranging from overly optimistic to completely invalid. The cause is highly correlated data – where the training data contains information you are trying to predict.

How to Minimize Feature Leakage:

  1. Remove data that could not be known at the time of prediction.
  2. Perform data cross-validation.
  3. If you suspect a variable is leaky, remove it and run again.
  4. Hold back a validation data set.
  5. Consider near-perfect model accuracy a warning sign.
  6. Check variables of importance for overly predictive features.

If you are a Squark user, you’ll be happy to know that our AutoML identifies and removes highly correlated data before building models. Squark uses cross-validation and holds back a validation data set as well. Squark always displays accuracy and variables of importance for each model.