#5 What Is Overfitting and How Do I Avoid It?

What Is Overfitting?

Telltale Super-Accuracy on Training Data

When machine learning models show exceptional accuracy on training data sets, but perform poorly on new, unseen data, they are guilty of overfitting. Overfitting happens when models “learn” from noise in data instead of from true signal patterns.

How to Avoid Overfitting

Detecting overfitting is the first step. Comparing accuracy against a portion of training that was data set aside for testing will reveal when models are overfitting. Techniques to minimize overfitting include:

  • Tuning Hyperparameters – Hyperparameters are descriptions of data set properties—information about the data, not the data itself. Hyperparameters can be used to adjust settings for different families of machine learning algorithms so they perform well and do not overfit.
  • Cross-Validation – Cross-validation splits training data into additional train-test sets to tune hyperparameters iteratively, without disturbing the initial test set-aside data.
  • Early Stopping – Machine learning algorithm training generally improves model performance with more attempts—up to a point. Comparing model performance at each building iteration and stopping when accuracy no longer improves prevents overfitting.

Squark Seer automatically employs these and other approaches to minimize overfitting. As always, get in touch if you have questions about Overfitting or any other Machine Learning topic. We’re happy to help.

#6 How Should Dates and Times be Modeled?

Factoring and Time Series

How date and time features are important in models.

Many data sets contain date-time fields which we hope will provide predictive value in our models. But date-time fields in the form of MM-DD-YYY HH:SS are essentially unique data points. In addition, the order in which events occur may have a bearing on outcomes.

Date Factoring

Date-time fields can be separated into component variables of Month, Date, Year, Time, Day of Month, Day of Week, and Day of Year. Pre-processing data sets to add columns for these individual variables may add predictive value when building models. (Squark automatically factors date-time fields before model building and ranking.)

Time Series Forecasting

Models that consider the sequence in which events occur are called time series analytics. This technique is essential to account for factors such as seasonality, weather conditions, and economic indicators. Sales forecasts and marketing projections are  classic use cases for time series forecasting.

As always, get in touch if you have questions about using date-time in your predictions, or any other Machine Learning topic. We’re happy to help.