Data Science

An interdisciplinary field encompassing scientific processes and systems that extract knowledge or insights from data in various forms, either structured or unstructured. It is an extension of data analysis fields such as statistics, machine learning, data mining, and predictive analytics.

Date Factoring

Date factoring is a feature engineering technique that splits date-time data into its component parts. For instance, a date-time field with a format of MM-DD-YYY HH:SS can be separated into variables of Month, Date, Year, Time, Day of Month, Day of Week, and Day of Year. Pre-processing data sets to add columns for these individual variables may add predictive value when building models.

Where the sequence in which events occur is important, regression models that forecast values based solely on discrete date/time factors may not provide useful predictions. Sales forecasting or market projections are classic examples. See Time-series Forecasting.

Decision tree

A tree and branch-based model used to map decisions and their possible consequences, similar to a flow chart.

Deep Learning

Deep Learning is a machine learning technique where the system leans by example, similar to human learning. Deep Learning is often used where the size and complexity of data sets overwhelm more structured techniques. Ability for deep learners to extract features from the data automatically from unstructured data enables use for applications such as image and voice processing.

The “deep” refers to the algorithms’ passing data from one layer of analysis to another – up to hundreds of layers. Each layer adds progressive refinement to classifications.

Embodied AI

Robots that are equipped with AI functionality.

Ethical AI

Ethical AI uses artificial intelligence to enhance the human condition by performing tasks that are menial or impractically slow to accomplish manually. Key factors in maintaining ethical AI include:

  • Explainability – the algorithms used can be understood by humans, with calculations that can be explained in plain language. This is essential to verifying that AI is serving its intended purpose.
  • Boundedness – Ethical AI is set to operate within pre-determined boundaries, and does not have the ability to create its own pathways to exploring or learning unintended information.
  • Purpose – Ethical AI is modeled to produce only specific answers to well-defined problems.
Few-shot Learning

Computer vision systems typically vast numbers of examples to learn how to do something. Few-shot learning tries to build systems that can be taught with minimal training examples.

Generative Adversarial Networks (GANs)

Two neural networks are trained on the same data sets. One of the then creates similar content while the other tries to determine how that result compares to the original data set. Feedback between the two can improve results. Realistic, but wholly new, media and artworks can be produced this way


View/Download as PDF

Squark Seer produces a Leaderboard that lists the best-performing models that were trained on your specific data from Squark’s set of powerful codeless AI algorithms.  While Squark Seer may have built thousands of models while you waited for results, the Leaderboard only contains the most accurate model for each algorithm we used. For example, if a Deep Learner displays in the Leaderboard, then it is the most accurate Deep Learning model we created, out of perhaps thousands of Deep Learners built for your data.  Since Squark’s Leaderboard only contains the most accurate instances of models and their underlying algorithms, if a model algorithm is absent it is because it was not used.

Squark cross-validates more than 15 algorithms that are automatically applied to your data.  These algorithms include fixed, specific, and dynamic grids and multiple instances of algorithms including:, XGBoost Gradient Boosting Machines, other Gradient Boosting Machines, general linear models (GLMs), multiple “Tree” methods such as Distributed Random Forests, Extreme Trees, & Isolation Trees, multiple Deep Neural Networks, and multiple types of Ensemble Models.

Each model/algorithm is listed in order of accuracy using a default metric.  Squark uses the metric “Area Under the Curve” (AUC) for binary classification, the metric “Mean per Class Error” for multi-class classification and the metric “Residual Deviance” for Regression.

How does Squark rank the Leaderboard?
Squark ranks the best model for your data in the Leaderboard on your results page. The ranking metric is different based on the model class. For binary classification, Squark uses Area Under the Curve (AUC).  For multi-class classification, Squark uses the Average or Mean Error per Class.  For regression, Squark uses Deviance. For all model classes, the best performing algorithm and the resultant model is identified on the top row of the Leaderboard based on the ranking metric.  This best in class model is used to determine the predictions. Squark provides a full listing of Leaderboard metrics, which may be helpful for advanced users and data scientists, including:

  • Area Under the Curve or AUC (in Binary Classification Only) is used to evaluate how well a binary classification model is able to distinguish between true positives and false positives. An AUC of 1 indicates a perfect classifier, while an AUC of .5 indicates a poor classifier, whose performance is no better than random guessing.
  • Mean Per Class Error (in Multi=class Classification only) is the average of the errors of each class in your multi-class dataset. This metric speaks toward mis-classification of the data across the classes. The lower this metric, the better.
  • Residual Deviance (in Regression Only) is short for Mean Residual Deviance and measures the goodness of the model’s fit. In a perfect world, this metric would be zero. Deviance is equal to MSE in Gaussian distributions. If Deviance doesn’t equal MSE, then it gives a more useful estimate of error, which is why Squark uses it as the default metric to rank for regression models.
  • Logloss (or Logarithmic Loss) measures classification performance; specifically, uncertainty. This metric evaluates how closely a model’s predicted values are to the actual target value. For example, does a model tend to assign a high predicted value like .90 for the positive class, or does it show a poor ability to identify the positive class and assign a lower predicted value like .40? Logloss ranges between 0 and 1, with 0 meaning that the model correctly assigns a probability of 0% or 100%. Logloss is sensitive to low probabilities being erroneous.
  • MAE or the Mean Absolute Error is an average of the absolute errors. The smaller the MAE, the better the model’s performance. The MAE units are the same units as your data’s dependent variable/target (so if that’s dollars, this is in dollars), which is useful for understanding whether the size of the error is meaningful or not. MAE is not sensitive to outliers. If your data has a lot of outliers, then examine the Root Mean Square Error (RMSE), which is sensitive to outliers.
  • MSE is the Mean Square Error and is a model quality metric. Closer to zero is better.  The MSE metric measures the average of the squares of the errors or deviations. MSE takes the distances from the points to the regression line (these distances are the “errors”) and then squares them to remove any negative signs. MSE incorporates both the variance and the bias of the predictor. MSE gives more weight to larger differences in errors than MAE.
  • RMSE is the Root Mean Square eError. The RMSE will always be larger then or equal to the MAE. The RMSE metric evaluates how well a model can predict a continuous value. The RMSE units are the same units as your data’s dependent variable/target (so if that’s dollars, this is in dollars), which is useful for understanding whether the size of the error is meaningful or not. The smaller the RMSE, the better the model’s performance. RSME is sensitive to outliers. If your data does not have outliers, then examine the Mean Average Error (MAE), which is not as sensitive to outliers.
  • RMSLE is the Root Mean Square Logarithmic Error. It is the ratio (the log) between the actual values in your data and predicted values in the model. Use RMSLE instead of RMSE if an under-prediction is worse than an over-prediction – where underestimating is more of a problem overestimating. For example, is it worse off to forecast too much sales revenue or too little?  Use RMSLE when your data has large numbers to predict and you don’t want to penalize large differences between the actual and predicted values (because both of the values are large numbers).
  • Confusion Matrix, if calculated, is a table depicting performance of the model used for predictions in the context of the false positives, false negatives, true positives, and true negatives, generated via cross-validation.