What is Bias in Machine Learning?

Bias occurs when ML does not separate the true signal from the noise in training data.

Biases in AI systems make headlines for results such as favoring gender in hiring, recommending loans based on ethnicity, or recognizing faces differently based on race. Some of these cases were due to biases baked into the algorithms written by (human) data scientists, but the majority merely learned from data that was itself biased.

How do you know if your business predictions are biased? Testing against broader sets of known outcomes is the best way. Since you don’t necessarily know which factors may be introducing bias, examination of the predictive importance placed on data features can help reveal them. Squark shows lists of Variable Importance for the models it generates. Click on the model name link In the Squark Leaderboard to see them. Different algorithms can produce different ranks for variable importance, which may lend insight.

Bias in Training Data
Selecting training data wisely is the best way to reduce bias. For instance, if the training data set you select is dominated by outcomes that you expect, it should be no surprise that the model will include confirmation bias

Bias in Algorithms
Algorithmic bias occurs when model building takes too few training variables into account. In data sets with large numbers of features (columns), algorithms that can handle only fixed or limited numbers of training variables show high bias and result in underfitting. Certain algorithms such as Linear Regression, Linear Discriminant Analysis, and Logistic Regression are prone to high bias.

The takeaway: If you think your predictions may show bias, experiment. Go back to the variable selection and select/deselect suspicious columns. Iterate as many times as you need to understand your data. At that point you may decide to revise the training and production files to reflect reality with less of a “thumb on the scale.”