What Is Feature Engineering?

What Is Feature Engineering?

Data sets can be made more predictive with a little help.

“Features” are the properties or characteristics of something you want to predict. Machine learning predictions can often be improved by “engineering”— adjusting features that are already there or adding features that are missing. For instance, date/time fields may appear as nearly unique values that are not predictive. Breaking the single date/time feature into year, month, day, day of the week, and time of day may allow the machine learning model to reveal patterns otherwise hidden.

Automated ML to the Rescue

You can engineer features on your own data sets, but automated feature engineering is now part of advanced machine learning systems. This ability to sense opportunities to improve data set values and do so automatically contributes vastly improved performance without the tedium of doing it manually.

Common kinds of feature engineering include:

  • Expansion, as in the date/time example
  • Binning – grouping variables with minor variations into fewer values, as in putting all heights from 5’7.51” to 5’8.49” into category 5’8”
  • Imputing values – adding, subtracting, or multiplying features that interact
  • Removing unused or redundant features
  • Text vectoring – deriving commonalities from repeated terms in otherwise unique strings

Ever wonder why Variables of Importance lists don’t exactly match your source data set variables? That’s Squark using automatic feature engineering to improve prediction by reducing or expanding the number of variable columns.

As always, get in touch if you have questions about Feature Engineering or any other Machine Learning topic. We’re happy to help.

Why 50% Isn’t Always The Prediction Cut-off

There Is this “F1” Thing

Why 50% probability isn’t always always the prediction cut-off.

Say you are classifying 100 examples of fruit and there are 99 oranges and one lime. If your model predicted all 100 are oranges, then it is 99% accurate. But the high accuracy veils the model’s inability to detect the difference between oranges and limes. Changing the break point for prediction confidence is a way to improve a model’s usefulness when column values have imbalanced frequencies like this.

Classification models compute a factor called “F1”  to account for this behavior. F1 is a score between 1 (best) and zero (worst) that shows classification ability. It is simply a measure of how well the model performed at identifying the differences among groups.

How Does It Work?

Knowing why models classify the way they do is at the heart of explainability. Fortunately, you don’t need to understand the equations to see what F1 reveals about classification performance. If you are really curious, the Max F1 post in Squark’s AI Glossary explains the math and how the maximum F1 value is used in Squark classification models.

Feature Leakage – Causes and Remedies

Do your models seem too accurate? They might be.

Feature leakage, a.k.a. data leakage or target leakage, causes predictive models to appear more accurate than they really are, ranging from overly optimistic to completely invalid. The cause is highly correlated data – where the training data contains information you are trying to predict.

How to Minimize Feature Leakage:

  1. Remove data that could not be known at the time of prediction.
  2. Perform data cross-validation.
  3. If you suspect a variable is leaky, remove it and run again.
  4. Hold back a validation data set.
  5. Consider near-perfect model accuracy a warning sign.
  6. Check variables of importance for overly predictive features.

If you are a Squark user, you’ll be happy to know that our AutoML identifies and removes highly correlated data before building models. Squark uses cross-validation and holds back a validation data set as well. Squark always displays accuracy and variables of importance for each model.

Your Data Does Not Have to Be Big

Your Data Does Not Have to Be Big

In fact, certain algorithms work well with smaller datasets.

Some models do require big datasets to deliver significant predictive power. But don’t assume that you need hundreds of feature columns or millions of rows. We’ve seen surprisingly usable accuracy from as few as a hundred rows and a dozen columns.

Data is Different. AI Must Be Too.


The whole point of using machine learning is that AI is better at finding patterns in data than legacy methods. Try different algorithms and see if you converge on reasonable prediction accuracy with whatever data you already have at hand. AI may or may not produce actionable predictions.  Regardless, you’ll learn a great deal about how much data and which features will ultimately make you most successful.

Get started now. Waiting for your fantasy, all-encompassing datasets will leave you permanently behind the curve.