Bias

Bias is the characteristic of models to learn from some variables and not others. Some bias is essential, since machine learning must predict based on data features that are more predictive than others.

High bias occurs when model training uses too few variables, due either to limited training data features or restrictions on the number of variables and algorithm is able to consider. High bias results in underfitting.

Low bias desirable, but is a trade-off with variance in algorithm performance.

Variance

Variance is a measure of a model’s sensitivity to fluctuations in training data. Models with high variance predict based on noise in training data instead of the true signal. The result is overfitting – a characteristic that shows its inability to be predictive on new data while apparently being very accurate on training data.

Low variance is desirable, but is a trade-off with bias in algorithm performance.

Underfitting

Underfitting occurs when models do not learn sufficiently from training data to be useful for generalized predictions. Under-fit models do not detect the underlying patterns in data, often due to over-simplification of features or over-regularization of training data.

Overfitting

Overfitting happens when models perform well – with high apparent accuracy – on training data, but that perform poorly on new data. This is often the result of learning from noise or fluctuations in training data. Comparing results to hold-out data reveals the extent of a model’s ability to be useful for generalized predictions, and are good barometers for detecting overfitting.

Confirmation Bias

Confirmation bias is a human tendency to find answers that match preconceived beliefs. It may manifest through selective gathering of evidence that supports desired conclusions and/or by interpreting results in ways that reinforce beliefs.

Confirmation bias can enter data analysis through unbalanced selection of the data to be analyzed and/or by filtering the resulting analyses in ways that support preconceived notions.

Monte Carlo Simulation vs. Machine Learning

Simulation uses models constructed by experts to predict probabilities. Machine Learning builds its own models to predict future outcomes.

Monte Carlo (the place) is the iconic capital of gambling—an endeavor that relies exclusively on chance probabilities to determine winners and losers. Monte Carlo (the method) employs random inputs to models to make predictions on how a system will behave.

When subject matter experts create good Simulation models, they can be valuable in revealing probabilities in complex systems with large numbers of variables—such as predicting human behaviors in markets. “What if?” scenarios can be tested because individual data points or sets of data points can be manipulated to show their effects on the entirety.

Machine Learning builds its own models based on data sets of known outcomes. Predictions are done automatically by applying these models to new sets of data. This methodology is perfect for business analyses such as identifying customers who will churn or predicting customer lifetime value. No human input or modelling skill is required. “The cards call themselves,” as you might say for hands at the Baccarat table.

The take-away: Simulation excels where domain expertise can be captured to build accurate models to enable experimentation—even creating data inputs to see what happens. Machine Learning is best for fast, automatic predictions on new data based on observations of known outcomes. They are not mutually exclusive. In fact, Machine Learning can be handy to test and refine Simulation models.

Data Mining vs. Machine Learning

Data Mining describes patterns, correlations, and anomalies in data.

Mines are not the best analogies for the processes referred to as Data Mining. Never mind that we call data storage places bases, warehouses, and lakes. Extraction of raw data material is not the goal of data mining, but rather identifying characteristics within data sets that can be used to make decisions and predictions.

Think of Data Mining as applying statistics to make it easier for humans to understand past events recorded in data. By making assumptions and testing them, insights may be generated to help make decisions or predict general behavior in the future. Since all its variables are known and static, data mining itself cannot predict specific behavior on new variables.

Data Mining Processes
Here are some of the commonly used terms for tasks in data mining:

  • Anomaly Detection – identifying records that are different enough from others to be checked as errors or outliers.
  • Dependency Modelling – Identifying relationships among variables, such as market basket analysis for items frequently bought together.
  • Clustering – Identifying characteristics of groups of records that are more similar to each other than to other groups.
  • Classification – Calculating the probability that a record matches one or more sets of variables.
  • Regression – Estimating the relationship among an independent variable and one or more dependent variables.
  • Summarization – Creating a shortened example set of data, including reports and graphical representations.

Data Mining is good for preparing data and understanding variables that may be useful for predictions. The constraints of time and human analytical capacity to query, join, parse, and process large data sets makes Data Mining ill-suited to production predictive analysis.

Machine Learning to the Rescue
Automated Machine Learning (AutoML) automatically makes assumptions and iterates the models until it understands patterns—without the need for human intervention. This means that programming to account for every possible data relationship is unnecessary. The speed of results—even for large data sets—is remarkable. Best of all, the AI models can be applied to fresh data automatically, which is the essence of prediction.

The take-away: Data mining is useful to gain insights and to prepare data for predictive analytics, including AutoML. Machine Learning uses data patterns to predict future outcomes for new records.