Resources

Squark Glossary of AI Terms

AUC (in Binary Classification Only) is used to evaluate how well a binary classification model is able to distinguish between true positives and false positives. An AUC of 1 indicates a perfect classifier, while an AUC of .5 indicates a poor classifier, whose performance is no better than random guessing.

Computer systems patterned after human intelligence in their ability to learn and recognize so that previously unseen information can be acted on in ways that produce useful results. The foundations of AI include logic, mathematics, probability, decision theory, neuroscience, and linguistics.

Algorithms loosely modeled after the human brain, with layers of connected elements that send information to each other in the way human neurons interact.

Automated Machine Leaning (AutoML) refers to systems that build machine learning models with some degree less manual coding than a data science programmer would do building models from scratch.

At Squark, AutoML means absolutely no coding or scripting of any kind. This is the strongest definition of AutoML. All of the steps in making predictions with machine learning models – import of training and production data, variable identification, feature engineering, classification or regression algorithm selection, hyperparameter tuning, leaderboard explanation, variable importance listing, and export of prediction data set – through a SaaS, point-and-click interface.

Various other implementations of machine learning are dubbed AutoML, but actually require extensive knowledge of data science and programming. For example, you may need to select algorithm type, pick hyperparameter ranges, launch from a Jupyter notebook, know Python, or use other processes that are not familiar.

Bias is the characteristic of models to learn from some variables and not others. Some bias is essential, since machine learning must predict based on data features that are more predictive than others.

High bias occurs when model training uses too few variables, due either to limited training data features or restrictions on the number of variables and algorithm is able to consider. High bias results in underfitting.

Low bias desirable, but is a trade-off with variance in algorithm performance.

Data sets that are so large or complex that traditional data processing applications are inadequate to deal with them.

Algorithms with output and decision-making processes that cannot readily be explained by developers or the computer itself.

Classification in statistical or machine learning models refers to description of the relationship between a dependent variable (outcome variable) and independent variables (features) in data sets when comparing discrete values (integers, enumerations, strings, text vectors, etc.), as opposed to scalar (continuously variable) real numbers.

Machine learning classification algorithms assign categories to data set members based on the models built from training data. Binary classification models predict “yes-no” or “in-out” for each row when there are only two choices (classes) of independent variable. Multi-variate, or multinomial, classification models predict the probability that a data set member is in one of three or more classes.

Clustering algorithms let machines group data points or items into groups with similar characteristics.

Coefficients indicate the relationship of independent variables to the dependent variable in a model. Positive coefficients show that as the independent variable moves upwards, so does the dependent variable. Negative coefficients indicate that as the coefficient goes down, so does the dependent variable.

Use of AI to examine and interpret images to define or recognize them like the way humans see.

Confirmation bias is a human tendency to find answers that match preconceived beliefs. It may manifest through selective gathering of evidence that supports desired conclusions and/or by interpreting results in ways that reinforce beliefs.

Confirmation bias can enter data analysis through unbalanced selection of the data to be analyzed and/or by filtering the resulting analyses in ways that support preconceived notions.

A Confusion Matrix, if calculated, is a table depicting performance of prediction models on false positives, false negatives, true positives, and true negatives. It is so named because it shows how often the model confuses the two labels. The matrix is generated by cross-validation – comparing predictions against a benchmark hold-out of data.

An interdisciplinary field encompassing scientific processes and systems that extract knowledge or insights from data in various forms, either structured or unstructured. It is an extension of data analysis fields such as statistics, machine learning, data mining, and predictive analytics.

Date factoring is a feature engineering technique that splits date-time data into its component parts. For instance, a date-time field with a format of MM-DD-YYY HH:SS can be separated into variables of Month, Date, Year, Time, Day of Month, Day of Week, and Day of Year. Pre-processing data sets to add columns for these individual variables may add predictive value when building models.

Where the sequence in which events occur is important, regression models that forecast values based solely on discrete date/time factors may not provide useful predictions. Sales forecasting or market projections are classic examples. See Time-series Forecasting.

A tree and branch-based model used to map decisions and their possible consequences, similar to a flow chart.

Deep Learning is a machine learning technique where the system leans by example, similar to human learning. Deep Learning is often used where the size and complexity of data sets overwhelm more structured techniques. Ability for deep learners to extract features from the data automatically from unstructured data enables use for applications such as image and voice processing.

The “deep” refers to the algorithms’ passing data from one layer of analysis to another – up to hundreds of layers. Each layer adds progressive refinement to classifications.

Robots that are equipped with AI functionality.

Ethical AI uses artificial intelligence to enhance the human condition by performing tasks that are menial or impractically slow to accomplish manually. Key factors in maintaining ethical AI include:

  • Explainability – the algorithms used can be understood by humans, with calculations that can be explained in plain language. This is essential to verifying that AI is serving its intended purpose.
  • Boundedness – Ethical AI is set to operate within pre-determined boundaries, and does not have the ability to create its own pathways to exploring or learning unintended information.
  • Purpose – Ethical AI is modeled to produce only specific answers to well-defined problems.

AI that reveals to human users how it arrived at its conclusions.

“Features” are the properties or characteristics of something you want to predict. Machine learning predictions can often be improved by “engineering”— adjusting features that are already there or adding features that are missing. For instance, date/time fields may appear as nearly unique values that are not predictive. Breaking the single date/time feature into year, month, day, day of the week, and time of day may allow the machine learning model to reveal patterns otherwise hidden.

You can engineer features on your own data sets, but automated feature engineering is now part of advanced machine learning systems. This ability to sense opportunities to improve data set values and do so automatically contributes vastly improved performance without the tedium of doing it manually.

Common kinds of feature engineering include:

  • Expansion, as in the date/time example
  • Binning – grouping variables with minor variations into fewer values, as in putting all heights from 5’7.51” to 5’8.49” into category 5’8”
  • Imputing values – adding, subtracting, or multiplying features that interact
  • Removing unused or redundant features
  • Text vectoring – deriving commonalities from repeated terms in otherwise unique strings

Computer vision systems typically vast numbers of examples to learn how to do something. Few-shot learning tries to build systems that can be taught with minimal training examples.

Two neural networks are trained on the same data sets. One of the then creates similar content while the other tries to determine how that result compares to the original data set. Feedback between the two can improve results. Realistic, but wholly new, media and artworks can be produced this way

Hyperparameters are variables external to and not directly related to data sets of know outcomes that are used to train Machine Learning models. hyperparameter is a configuration variable that is used to optimize model performance.

Automated Machine Learning (AutoML) systems such as Squark tune hyperparameters automatically. Data scientists who build models manually can write code that controls hyperparameters to seek ways to improve model performance.

Examples of hyperparameters are:

  • Learning rate and duration
  • Latent factors in matrix factorization
  • Leaves, or depth, of a tree
  • Hidden layers in a deep neural network
  • Clusters in a k-means clustering
  • The k in k-nearest-neighbors

View/Download as PDF

Squark Seer produces a Leaderboard that lists the best-performing models that were trained on your specific data from Squark’s set of powerful codeless AI algorithms.  While Squark Seer may have built thousands of models while you waited for results, the Leaderboard only contains the most accurate model for each algorithm we used. For example, if a Deep Learner displays in the Leaderboard, then it is the most accurate Deep Learning model we created, out of perhaps thousands of Deep Learners built for your data.  Since Squark’s Leaderboard only contains the most accurate instances of models and their underlying algorithms, if a model algorithm is absent it is because it was not used.

Squark cross-validates more than 15 algorithms that are automatically applied to your data.  These algorithms include fixed, specific, and dynamic grids and multiple instances of algorithms including:, XGBoost Gradient Boosting Machines, other Gradient Boosting Machines, general linear models (GLMs), multiple “Tree” methods such as Distributed Random Forests, Extreme Trees, & Isolation Trees, multiple Deep Neural Networks, and multiple types of Ensemble Models.

Each model/algorithm is listed in order of accuracy using a default metric.  Squark uses the metric “Area Under the Curve” (AUC) for binary classification, the metric “Mean per Class Error” for multi-class classification and the metric “Residual Deviance” for Regression.

How does Squark rank the Leaderboard?
Squark ranks the best model for your data in the Leaderboard on your results page. The ranking metric is different based on the model class. For binary classification, Squark uses Area Under the Curve (AUC).  For multi-class classification, Squark uses the Average or Mean Error per Class.  For regression, Squark uses Deviance. For all model classes, the best performing algorithm and the resultant model is identified on the top row of the Leaderboard based on the ranking metric.  This best in class model is used to determine the predictions. Squark provides a full listing of Leaderboard metrics, which may be helpful for advanced users and data scientists, including:

  • Area Under the Curve or AUC (in Binary Classification Only) is used to evaluate how well a binary classification model is able to distinguish between true positives and false positives. An AUC of 1 indicates a perfect classifier, while an AUC of .5 indicates a poor classifier, whose performance is no better than random guessing.
  • Mean Per Class Error (in Multi=class Classification only) is the average of the errors of each class in your multi-class dataset. This metric speaks toward mis-classification of the data across the classes. The lower this metric, the better.
  • Residual Deviance (in Regression Only) is short for Mean Residual Deviance and measures the goodness of the model’s fit. In a perfect world, this metric would be zero. Deviance is equal to MSE in Gaussian distributions. If Deviance doesn’t equal MSE, then it gives a more useful estimate of error, which is why Squark uses it as the default metric to rank for regression models.
  • Logloss (or Logarithmic Loss) measures classification performance; specifically, uncertainty. This metric evaluates how closely a model’s predicted values are to the actual target value. For example, does a model tend to assign a high predicted value like .90 for the positive class, or does it show a poor ability to identify the positive class and assign a lower predicted value like .40? Logloss ranges between 0 and 1, with 0 meaning that the model correctly assigns a probability of 0% or 100%. Logloss is sensitive to low probabilities being erroneous.
  • MAE or the Mean Absolute Error is an average of the absolute errors. The smaller the MAE, the better the model’s performance. The MAE units are the same units as your data’s dependent variable/target (so if that’s dollars, this is in dollars), which is useful for understanding whether the size of the error is meaningful or not. MAE is not sensitive to outliers. If your data has a lot of outliers, then examine the Root Mean Square Error (RMSE), which is sensitive to outliers.
  • MSE is the Mean Square Error and is a model quality metric. Closer to zero is better.  The MSE metric measures the average of the squares of the errors or deviations. MSE takes the distances from the points to the regression line (these distances are the “errors”) and then squares them to remove any negative signs. MSE incorporates both the variance and the bias of the predictor. MSE gives more weight to larger differences in errors than MAE.
  • RMSE is the Root Mean Square eError. The RMSE will always be larger then or equal to the MAE. The RMSE metric evaluates how well a model can predict a continuous value. The RMSE units are the same units as your data’s dependent variable/target (so if that’s dollars, this is in dollars), which is useful for understanding whether the size of the error is meaningful or not. The smaller the RMSE, the better the model’s performance. RSME is sensitive to outliers. If your data does not have outliers, then examine the Mean Average Error (MAE), which is not as sensitive to outliers.
  • RMSLE is the Root Mean Square Logarithmic Error. It is the ratio (the log) between the actual values in your data and predicted values in the model. Use RMSLE instead of RMSE if an under-prediction is worse than an over-prediction – where underestimating is more of a problem overestimating. For example, is it worse off to forecast too much sales revenue or too little?  Use RMSLE when your data has large numbers to predict and you don’t want to penalize large differences between the actual and predicted values (because both of the values are large numbers).
  • Confusion Matrix, if calculated, is a table depicting performance of the model used for predictions in the context of the false positives, false negatives, true positives, and true negatives, generated via cross-validation.

A branch of mathematics concerning vector spaces and linear mappings between them. It includes the study of lines, planes, and subspaces, but is also concerned with properties common to all vector spaces.

Logloss (or Logarithmic Loss) measures classification performance; specifically, uncertainty. This metric evaluates how closely a model’s predicted values are to the actual target value. For example, does a model tend to assign a high predicted value like .90 for the positive class, or does it show a poor ability to identify the positive class and assign a lower predicted value like .40? Logloss ranges between 0 and 1, with 0 meaning that the model correctly assigns a probability of 0% or 100%. Logloss is sensitive to low probabilities that are erroneous.

“Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed.” This definition, often attributed to computer pioneer Arthur L. Samuel, is actually a paraphrase of his work from a 1959 paper, “Some Studies in Machine Learning Using the Game of Checkers” in IBM Journal of Research and Development.

This notion that computers could learn from data and outcomes does hold up as a useful description of Machine Learning today. Samuel correctly predicted, “Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.”

F1 is a score between 1 (best) and zero (worst) that shows how well a classification algorithm did at training on your dataset. It is a check different from accuracy that measures how well the model performed at identifying the differences among groups. For instance, if you are classifying 100 types of wine – 99 red and one white – and your model predicted 100 are red, then it is 99% accurate. But the high accuracy veils the model’s inability to detect the difference between red and white wines.

F1 is particularly revelatory when there are imbalances in class frequency, as in the wine example. F1 calculations consider both Precision and Recall in the model:

Precision = How likely is a positive classification to be correct? = True Positives/(True Positives + False Positives)

Recall = How likely is the classifier to detect a positive? = True Positives/(True Positives + False Negatives)

F1 = 2 * ((Precision * Recall) / (Precision + Recall))

Max F1 is the cut-off point for probabilities in predictions. When a row’s P1 (will occur) value is at or above the Max F1, the outcome will be predicted to happen in the future. If a row’s P0 (won’t occur) value is below the Max F1, the outcome will be predicted not to happen.  This explains why the cutoff point is not always 50% as you might expect.

MAE or the Mean Absolute Error is an average of the absolute errors. The smaller the MAE the better the model’s performance. The MAE units are the same units as your data’s dependent variable/target (so if that’s dollars, this is in dollars), which is useful for understanding whether the size of the error is meaningful or not. MAE is not sensitive to outliers. If your data has a lot of outliers, then examine the Root Mean Square Error (RMSE), which is sensitive to outliers.

Mean Per Class Error (in Multi-class Classification only) is the average of the errors of each class in your multi-class data set. This metric speaks toward misclassification of the data across the classes. The lower this metric, the better.

MSE is the Mean Square Error and is a model quality metric.  Closer to zero is better.  The MSE metric measures the average of the squares of the errors or deviations. MSE takes the distances from the points to the regression line (these distances are the “errors”) and then squares them to remove any negative signs. MSE incorporates both the variance and the bias of the predictor. MSE gives more weight to larger differences in errors than MAE.

The discipline within A.I. that deals with written and spoken language.

Overfitting happens when models perform well – with high apparent accuracy – on training data, but that perform poorly on new data. This is often the result of learning from noise or fluctuations in training data. Comparing results to hold-out data reveals the extent of a model’s ability to be useful for generalized predictions, and are good barometers for detecting overfitting.

Pragmatic AI is designed to solve well-defined problems, as opposed to being allowed to seek its own purpose.

Statistical techniques gathered from predictive modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events.

Regression in statistical or machine learning models refers to description of the relationship between a dependent variable (outcome variable) and independent variables (features) in data sets when the values are scalar (continuously variable) real numbers, as opposed to discrete values (integers, enumerations, strings, text vectors, etc.).

Leaning from unlabeled data based on reward-punishment feedback with successive tries at stochastic (random) solutions to problems. Reinforcement Learning is useful when there are rules, but no pre-defined methods to approach problems, such as in games or autonomous navigation.

Residual Deviance (in Regression Only) is short for Mean Residual Deviance and measures the goodness of the models’ fit. In a perfect world this metric would be zero. Deviance is equal to MSE in Gaussian distributions. If Deviance doesn’t equal MSE, then it gives a more useful estimate of error, which is why Squark uses it as the default metric to rank for regression models.

RMSE is the Root Mean Square Error. The RMSE will always be larger or equal to the MAE. The RMSE metric evaluates how well a model can predict a continuous value. The RMSE units are the same units as your data’s dependent variable/target (so if that’s dollars, this is in dollars), which is useful for understanding whether the size of the error is meaningful or not. The smaller the RMSE, the better the model’s performance.  RSME is sensitive to outliers. If your data does not have outliers, then examine the Mean Average Error (MAE), which is not as sensitive to outliers.

RMSLE, or the Root Mean Square Logarithmic Error, is the ratio (the log) between the actual values in your data and predicted values in the model. Use RMSLE instead of RMSE if an under-prediction is worse than an over-prediction – where underestimating is more problematic than overestimating. For example, is it worse to forecast too much sales revenue or too little?  Use RMSLE when your data has large numbers to predict and you don’t want to penalize large differences between the actual and predicted values (because both of the values are large numbers).

SMOTE stands for Synthetic Minority Over-sampling Technique. Oversampling is a technique used to manage class imbalance in data sets. Data set imbalance occurs when the category you are targeting is very rare in the population, or where the data might simply be difficult to collect. SMOTE is helpful when the class you want to analyze is under-represented.

SMOTE works by generating new instances from existing minority cases that you supply as input. SMOTE does not change the number of majority cases.

New instances are not just copies of existing minority class instances. SMOTE synthesizes new minority instances between existing (real) minority instances. The algorithm takes samples of the feature space for each target class and its nearest neighbors, and generates new examples combining the features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more general.

1.) The company that produces Squark Seer, most powerful AI predictive tool available, distinguished by its use of automated machine learning (AutoML) to achieve completely codeless operation. See www.squarkai.com.

2.) In particle physics, the hypothetical supersymmetric boson counterpart of a quark, with spin 0.

Learning from data sets containing labels or known outcomes, where the algorithms build models based on the patterns in that “training” data. The resulting models are generalized and can be applied to new, never-before-seen data. Supervised Learning is used for classification and regression problems.

Training Data contains labels for data columns (features) and known outcomes for the columns (features) to be predicted. Known outcomes may included classifications of two (binary) or more (multi-class) possibilities. Known outcomes that are scalar values (numbers) are used for regression predictions such as forecasts.

When the machine learning process is completed, the Machine Learning system uses models built from Training Data to add predicted values to the Production Data. Production data with the appended prediction values are output as Predictions data sets.

Time series forecasting is a particular way of handling date-time information in model building. It takes into account the sequence in which events occur. This technique is essential when modeling regressions where factors such as seasonality, weather conditions, and economic indicators may be predictive of future outcomes. Consequently, sales forecasts and marketing projections are classic use cases for time series forecasting. Time series analysis utilizes algorithms that are specially tuned to predict using relative date-time information.

This method tries to take training data used for one thing and reused it for a new set of tasks, without having to retrain the system from scratch.

Underfitting occurs when models do not learn sufficiently from training data to be useful for generalized predictions. Under-fit models do not detect the underlying patterns in data, often due to over-simplification of features or over-regularization of training data.

Learning is unsupervised when AI algorithms are given unlabeled data and must make sense of it without any instruction. Such machines “teach themselves” what result to produce. The algorithm looks for structure in the training data, like finding which examples are similar to each other and grouping them into clusters.

Unsupervised Learning is used for clustering, association, anomaly detection, and recommendation engines.

Variable importance is a metric that indicates how much an independent variable contributes to predictions in a model. The higher the value shown for a variable in its ranking, the more important it is to the model generated.

Understanding the significance of predictors provides insights for interpreting results, and also may be useful for improving model quality. For instance, editing data sets to rationalize incorrect or incomplete columns — or removing irrelevant ones — can make models faster and more accurate.

Variance is a measure of a model’s sensitivity to fluctuations in training data. Models with high variance predict based on noise in training data instead of the true signal. The result is overfitting – a characteristic that shows its inability to be predictive on new data while apparently being very accurate on training data.

Low variance is desirable, but is a trade-off with bias in algorithm performance.

The current state of AI, which does single tasks like playing games recognize images, or predicting outcomes. This is as opposed to Strong AI, also known as Artificial General Intelligence (AGI), which could do anything that humans do.

Squark Glossary of AI Terms

Hyperparameter

Bias

Variance

Underfitting

Overfitting

Date Factoring

Pragmatic AI

Ethical AI

SMOTE

Squark

Max F1

Leaderboard

Logloss

Coefficients

Regression

Copyright © All Rights Reserved. Squark Is A Unit Of Vizadata, LLC   |   Privacy Policy   |   Site By Radar Media