Feature Engineering

“Features” are the properties or characteristics of something you want to predict. Machine learning predictions can often be improved by “engineering”— adjusting features that are already there or adding features that are missing. For instance, date/time fields may appear as nearly unique values that are not predictive. Breaking the single date/time feature into year, month, day, day of the week, and time of day may allow the machine learning model to reveal patterns otherwise hidden.

You can engineer features on your own data sets, but automated feature engineering is now part of advanced machine learning systems. This ability to sense opportunities to improve data set values and do so automatically contributes vastly improved performance without the tedium of doing it manually.

Common kinds of feature engineering include:

  • Expansion, as in the date/time example
  • Binning – grouping variables with minor variations into fewer values, as in putting all heights from 5’7.51” to 5’8.49” into category 5’8”
  • Imputing values – adding, subtracting, or multiplying features that interact
  • Removing unused or redundant features
  • Text vectoring – deriving commonalities from repeated terms in otherwise unique strings

Automated Machine Learning (AutoML)

Automated Machine Leaning (AutoML) refers to systems that build machine learning models with some degree less manual coding than a data science programmer would do building models from scratch.

At Squark, AutoML means absolutely no coding or scripting of any kind. This is the strongest definition of AutoML. All of the steps in making predictions with machine learning models – import of training and production data, variable identification, feature engineering, classification or regression algorithm selection, hyperparameter tuning, leaderboard explanation, variable importance listing, and export of prediction data set – through a SaaS, point-and-click interface.

Various other implementations of machine learning are dubbed AutoML, but actually require extensive knowledge of data science and programming. For example, you may need to select algorithm type, pick hyperparameter ranges, launch from a Jupyter notebook, know Python, or use other processes that are not familiar.

Hyperparameter

Hyperparameters are variables external to and not directly related to data sets of know outcomes that are used to train Machine Learning models. hyperparameter is a configuration variable that is used to optimize model performance.

Automated Machine Learning (AutoML) systems such as Squark tune hyperparameters automatically. Data scientists who build models manually can write code that controls hyperparameters to seek ways to improve model performance.

Examples of hyperparameters are:

  • Learning rate and duration
  • Latent factors in matrix factorization
  • Leaves, or depth, of a tree
  • Hidden layers in a deep neural network
  • Clusters in a k-means clustering
  • The k in k-nearest-neighbors

 

Bias

Bias is the characteristic of models to learn from some variables and not others. Some bias is essential, since machine learning must predict based on data features that are more predictive than others.

High bias occurs when model training uses too few variables, due either to limited training data features or restrictions on the number of variables and algorithm is able to consider. High bias results in underfitting.

Low bias desirable, but is a trade-off with variance in algorithm performance.

Variance

Variance is a measure of a model’s sensitivity to fluctuations in training data. Models with high variance predict based on noise in training data instead of the true signal. The result is overfitting – a characteristic that shows its inability to be predictive on new data while apparently being very accurate on training data.

Low variance is desirable, but is a trade-off with bias in algorithm performance.

Underfitting

Underfitting occurs when models do not learn sufficiently from training data to be useful for generalized predictions. Under-fit models do not detect the underlying patterns in data, often due to over-simplification of features or over-regularization of training data.

Overfitting

Overfitting happens when models perform well – with high apparent accuracy – on training data, but that perform poorly on new data. This is often the result of learning from noise or fluctuations in training data. Comparing results to hold-out data reveals the extent of a model’s ability to be useful for generalized predictions, and are good barometers for detecting overfitting.

Confirmation Bias

Confirmation bias is a human tendency to find answers that match preconceived beliefs. It may manifest through selective gathering of evidence that supports desired conclusions and/or by interpreting results in ways that reinforce beliefs.

Confirmation bias can enter data analysis through unbalanced selection of the data to be analyzed and/or by filtering the resulting analyses in ways that support preconceived notions.

Date Factoring

Date factoring is a feature engineering technique that splits date-time data into its component parts. For instance, a date-time field with a format of MM-DD-YYY HH:SS can be separated into variables of Month, Date, Year, Time, Day of Month, Day of Week, and Day of Year. Pre-processing data sets to add columns for these individual variables may add predictive value when building models.

Where the sequence in which events occur is important, regression models that forecast values based solely on discrete date/time factors may not provide useful predictions. Sales forecasting or market projections are classic examples. See Time-series Forecasting.

Time Series Forecasting

Time series forecasting is a particular way of handling date-time information in model building. It takes into account the sequence in which events occur. This technique is essential when modeling regressions where factors such as seasonality, weather conditions, and economic indicators may be predictive of future outcomes. Consequently, sales forecasts and marketing projections are classic use cases for time series forecasting. Time series analysis utilizes algorithms that are specially tuned to predict using relative date-time information.