Data Mining vs. Machine Learning

Data Mining describes patterns, correlations, and anomalies in data.

Mines are not the best analogies for the processes referred to as Data Mining. Never mind that we call data storage places bases, warehouses, and lakes. Extraction of raw data material is not the goal of data mining, but rather identifying characteristics within data sets that can be used to make decisions and predictions.

Think of Data Mining as applying statistics to make it easier for humans to understand past events recorded in data. By making assumptions and testing them, insights may be generated to help make decisions or predict general behavior in the future. Since all its variables are known and static, data mining itself cannot predict specific behavior on new variables.

Data Mining Processes
Here are some of the commonly used terms for tasks in data mining:

  • Anomaly Detection – identifying records that are different enough from others to be checked as errors or outliers.
  • Dependency Modelling – Identifying relationships among variables, such as market basket analysis for items frequently bought together.
  • Clustering – Identifying characteristics of groups of records that are more similar to each other than to other groups.
  • Classification – Calculating the probability that a record matches one or more sets of variables.
  • Regression – Estimating the relationship among an independent variable and one or more dependent variables.
  • Summarization – Creating a shortened example set of data, including reports and graphical representations.

Data Mining is good for preparing data and understanding variables that may be useful for predictions. The constraints of time and human analytical capacity to query, join, parse, and process large data sets makes Data Mining ill-suited to production predictive analysis.

Machine Learning to the Rescue
Automated Machine Learning (AutoML) automatically makes assumptions and iterates the models until it understands patterns—without the need for human intervention. This means that programming to account for every possible data relationship is unnecessary. The speed of results—even for large data sets—is remarkable. Best of all, the AI models can be applied to fresh data automatically, which is the essence of prediction.

The take-away: Data mining is useful to gain insights and to prepare data for predictive analytics, including AutoML. Machine Learning uses data patterns to predict future outcomes for new records.

Deep Learning vs. Machine Learning

Deep Learning is a category of machine learning with special advantages for some tasks and disadvantages for others.

Machine learning workflows begin by identifying features within data sets. For structured information with relatively few columns and rows, this is straightforward. Most practical business predictions such as classification and regression fall into this category.

Unstructured data, such as image and voice, have vast numbers of “features” in the form of individual pixels or wave forms. Identifying those features to structured AI algorithms is tedious or impossible. Deep Learning is a technique where the AI algorithm itself extracts progressively higher levels of feature recognition, passing information through potentially hundreds of neural network layers. Deep learning algorithms power image and speech recognition for driverless cars and hand-free speakers

Plusses of Deep Learning

  • Scale – Deep learners can handle vast amounts of data, and they always improve with more data. Shallow learning converges and stops improving with additional data.
  • Dimensions – Deep learners can move past the limitations of a few hundred columns to perform well on very wide structured data set.
  • Non-Numeric – Deep learning brings AI into the human realm of speech and vision, which serve people in new and valuable ways.

Minuses of Deep Learning

  • Training Data – Deep learners need labeled data from which to learn. Amassing sufficient examples for recognition accuracy to be learned can be daunting.
  •  Not for Small Data Sets – Data sets that are too simple or too small cause deep learners to fail by overfitting
  • Resource Consumption – Deep learning on vast data stores can require days or weeks of processing on a single problem. 

The take-away: Deep learners are great for unstructured data and may be useful for classification with large and detailed structured data sets. Squark includes deep learners in the stack of algorithms it uses for AutoML. You will know from the Squark Leaderboard whether deep learning was a winner.