Share via


The Nine Data Mining Algorithms in SSAS

SQL Server Analysis services includes nine algorithms. In addition, SSIS includes two text mining transformations.  the list below summarize the nine SSAS algorithms and their common usage.

Decision Tree: is a popular data mining algorithm, used to predict discrete and continuous variables. The results are comparatively easy to understand, which is a reason the algorithm is so popular. If you predict continuous variables, you get a piecewise multiple linear regression formula with a separate formula in each node of a tree. The algorithm uses the discrete input variables to split the tree into nodes. A tree that predicts continuous variables is a Regression Tree.  This algorithm can also predict discrete outputs.

**Linear Regression:**Linear Regression predicts continuous variables only, using a single multiple linear regression formula. The input variables must be continuous as well. Linear Regression is a simple case of a Regression Tree, but it is a tree with no splits.

Naive Bayes: Given each state of the predictable attribute, the Naive Bayes algorithm calculates probabilities for each possible state of the input attribute. You can later use those probabilities to predict an outcome of the target attribute you are predicting based on the known input attributes. Because this algorithm is quite simple, it builds the models very quickly. Therefore, you can use this algorithm as a starting point in your prediction task. The Naive Bayes algorithm does not support continuous attributes directly, though Analysis Services provides options to discretize continuous variables.  The Excel add-in will also discretize continuous variables for use with Naïve Bayes, through the "Analyze Key Influencers" option on the Analyze Tab.

Neural Network: The Neural Network algorithm comes from artificial intelligence research. You can use this algorithm for predictions as well. Neural networks search for nonlinear functional dependencies. They perform nonlinear transformations on the data in layers, from the input layer through a hidden layer to the output layer. Because they are harder to interpret than are linear algorithms such as Decision Trees and because they typically take more processing time, Neural Networks have not been as popular.  However, once processed, Neural Networks may compete favorably using Microsoft's Lift Charts.

Logistic regression: As Linear Regression is a simple Regression Tree, a Logistic Regression is a Neural Network without any hidden layers.  The logistic regression algorithm can predict either continuous or discrete outputs.

Clustering: The Clustering algorithm groups cases from a dataset into clusters containing similar characteristics. Using these clusters, you can explore the data and learn about relationships among your cases. Additionally, you can create predictions from the clustering model created by the algorithm. For example, you can use the Clustering method to group your customers for a Customer Relationship Management (CRM) application. In addition, you can use Clustering to search for anomalies in your data using the default Expectation Maximization processing. A case that is not part of any cluster may be a case worth further inspection. This ability is useful for fraud detection; a transaction that does not fit in any cluster discovered might be a fraudulent transaction.

Sequence Clustering: Sequence Clustering searches for clusters based on a model, rather than on similarity of cases. It builds models from sequences of events by using Markov Chains. You can use this algorithm on any sequential data. Typical usage would be an analysis of your company’s Web site usage.

Association Rules: The Association Rules algorithm is designed for market basket analysis or recommendation engines.The algorithm defines an itemset as a combination of items in a single transaction. The algorithm scans the dataset and counts the number of times the itemsets appear in transactions. You should use this algorithm to detect cross-selling opportunities. Association Rules typically needs many cases to train the model well.

Time Series: The Time Series algorithm is created for forecasting continuous variables. Internally, the algorithm uses Regression Trees on automatically transformed data; it is also called Auto-Regression Trees (ART).  This algorithm can also blend ARTxp (cross prediction) with ARIMA.

This article only provides an introductory outline.  You should read the latest MSDN documentation to study the details on the implementation and customization options for each of these powerful algorithms.  For SQL Server 2014, see http://technet.microsoft.com/en-us/library/ms175595.aspx