Root Mean Square Logarithmic Error or RMSLE

RMSLE, or the Root Mean Square Logarithmic Error, is the ratio (the log) between the actual values in your data and predicted values in the model. Use RMSLE instead of RMSE if an under-prediction is worse than an over-prediction – where underestimating is more problematic than overestimating. For example, is it worse to forecast too much sales revenue or too little?  Use RMSLE when your data has large numbers to predict and you don’t want to penalize large differences between the actual and predicted values (because both of the values are large numbers).

SMOTE

SMOTE stands for Synthetic Minority Over-sampling Technique. Oversampling is a technique used to manage class imbalance in data sets. Data set imbalance occurs when the category you are targeting is very rare in the population, or where the data might simply be difficult to collect. SMOTE is helpful when the class you want to analyze is under-represented.

SMOTE works by generating new instances from existing minority cases that you supply as input. SMOTE does not change the number of majority cases.

New instances are not just copies of existing minority class instances. SMOTE synthesizes new minority instances between existing (real) minority instances. The algorithm takes samples of the feature space for each target class and its nearest neighbors, and generates new examples combining the features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more general.

Squark

1.) The company that produces Squark Seer, most powerful AI predictive tool available, distinguished by its use of automated machine learning (AutoML) to achieve completely codeless operation. See www.squarkai.com.

2.) In particle physics, the hypothetical supersymmetric boson counterpart of a quark, with spin 0.

Supervised Learning

Learning from data sets containing labels or known outcomes, where the algorithms build models based on the patterns in that “training” data. The resulting models are generalized and can be applied to new, never-before-seen data. Supervised Learning is used for classification and regression problems.

Training Data contains labels for data columns (features) and known outcomes for the columns (features) to be predicted. Known outcomes may included classifications of two (binary) or more (multi-class) possibilities. Known outcomes that are scalar values (numbers) are used for regression predictions such as forecasts.

When the machine learning process is completed, the Machine Learning system uses models built from Training Data to add predicted values to the Production Data. Production data with the appended prediction values are output as Predictions data sets.

Time Series Forecasting

Time series forecasting is a particular way of handling date-time information in model building. It takes into account the sequence in which events occur. This technique is essential when modeling regressions where factors such as seasonality, weather conditions, and economic indicators may be predictive of future outcomes. Consequently, sales forecasts and marketing projections are classic use cases for time series forecasting. Time series analysis utilizes algorithms that are specially tuned to predict using relative date-time information.

Transfer Learning

This method tries to take training data used for one thing and reused it for a new set of tasks, without having to retrain the system from scratch.

Underfitting

Underfitting occurs when models do not learn sufficiently from training data to be useful for generalized predictions. Under-fit models do not detect the underlying patterns in data, often due to over-simplification of features or over-regularization of training data.

Unsupervised Learning

Learning is unsupervised when AI algorithms are given unlabeled data and must make sense of it without any instruction. Such machines “teach themselves” what result to produce. The algorithm looks for structure in the training data, like finding which examples are similar to each other and grouping them into clusters.

Unsupervised Learning is used for clustering, association, anomaly detection, and recommendation engines.

Variable Importance

Variable importance is a metric that indicates how much an independent variable contributes to predictions in a model. The higher the value shown for a variable in its ranking, the more important it is to the model generated.

Understanding the significance of predictors provides insights for interpreting results, and also may be useful for improving model quality. For instance, editing data sets to rationalize incorrect or incomplete columns — or removing irrelevant ones — can make models faster and more accurate.

Variance

Variance is a measure of a model’s sensitivity to fluctuations in training data. Models with high variance predict based on noise in training data instead of the true signal. The result is overfitting – a characteristic that shows its inability to be predictive on new data while apparently being very accurate on training data.

Low variance is desirable, but is a trade-off with bias in algorithm performance.