SSAS: Explaining the GetClusterCharacteristics Stored Procedure for Sequence Clustering Models
Original content by Allan Mitchell, SQL Server MVP and principal consultant at Konesans Limited in the UK. You can find Allan’s website at www.SQLIS.com*.
*NOTE: Alternatively, you can download a Word file of this whitepaper at TechNet Gallery. This is for several potential reasons: (1) To easily print it. (2) To see the original version. (3) To customize it for personal use. (4) To share an offline copy.
Introduction
Beginning with SQL Server 2005 Analysis Services, Microsoft introduced a new range of modeling algorithms to be used for [[data mining]], including Sequence Clustering. Along with these algorithms, Microsoft also provides viewers that allow you to see exactly what is going on inside the models. If you run SQL Server Profiler while using the viewers, you will discover that a number of stored procedures are being called behind the scenes. In this article, I take a closer look at one of those procedures, GetClusterCharacteristics, as used against a Sequence Clustering model. This procedure provides a lot of useful information about a clustering model that I can leverage to better understand the data and my model.
In this article:
The Problem
The stored procedures that are called when using the viewers are internal to Microsoft, and documentation is sparse. The procedures provide a lot of useful information and offer insight into what is happening behind the scenes in the viewers. To help you understand this information, I’ll step through the use of the GetClusterCharacteristics stored procedure.
For this article, I am using SQL Server 2008, but the behaviors of the stored procedure, the viewer, and of SQL Server Profiler are also valid for SQL Server 2005. The model used is part of the sample Analysis Services project, which can be found here:
http://msftasprodsamples.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=34035
Sequence Clustering Models
Before we go any further, I want to give an overview of Sequence Clustering models. This will help in understanding how things fit together and what Analysis Services Data Mining is doing.
Sequence Clustering is employed when we want to analyze events that have an inherent order to them, examples being webpage page navigation and protein structures, and then we cluster/group them based on thresholds within the sequence patterns.
Sequence clustering is really a mash up of two techniques. The first thing that happens is the sequences in the data are analyzed for patterns. This is done through the use of nth order Markov chains, and you can follow the URL below for a fuller explanation.
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf
After the chains in our data have been analyzed, a clustering technique is applied to those chains in order to group them together. The two choices of clustering techniques you have in SQL Server 2008 are K Means and Expectation Maximization (the default). A further discussion of clustering can be found here:
http://gauss.nmsu.edu/~lludeman/video/ch6pr.html
Although many Data Mining software providers provide sequence clustering models, the implementation of them differs slightly. Here is the technical reference of Microsoft’s implementation:
http://msdn.microsoft.com/en-us/library/cc645866.aspx
The Procedure
The GetClusterCharacteristics procedure takes 3 parameters:
- Model Name. The name of the mining model.
- Node Name. The name of the node you want to query. For help in understanding nodes, see http://msdn.microsoft.com/en-us/library/aa178243%28SQL.80%29.aspx
- Probability Threshold. A numeric value representing the minimum acceptable probability. Rows that do not meet this minimum are filtered out.
We can call the stored procedure from the DMX query editor in SQL Server 2008 Management Studio like this:
CALL System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('Sequence Clustering','14',0.0005) |
Query 1: Calling the stored procedure
The only parameter that will need a little explaining is the second parameter, Node Name.
The Node Name parameter can be retrieved by executing the following query in the DMX query editor in Management Studio.
SELECT NODE_CAPTION, NODE_UNIQUE_NAMEFROM [Sequence Clustering].CONTENT |
Query 2: Getting the unique node name
Because the results from Query 2 are potentially very large, I am going to show only a few rows, but this should give us enough information to understand what the query returns.
NODE_CAPTION | NODE_UNIQUE_NAME |
All | 0 |
Cluster 1 | 1 |
Cluster 2 | 2 |
Cluster 3 | 3 |
Cluster 4 | 4 |
Cluster 5 | 5 |
Cluster 6 | 6 |
Cluster 7 | 7 |
Cluster 8 | 8 |
Cluster 9 | 9 |
Cluster 10 | 10 |
Cluster 11 | 11 |
Cluster 12 | 12 |
Cluster 13 | 13 |
Cluster 14 | 14 |
Cluster 15 | 15 |
Sequence level for cluster 16 | 1081327 |
Transition row for sequence state 0 | 1081328 |
Transition row for sequence state 1 | 1081329 |
Transition row for sequence state 2 | 1081330 |
Transition row for sequence state 3 | 1081331 |
Transition row for sequence state 4 | 1081332 |
Table 1: Results of Query 2
For this article, I am interested in looking at the characteristics of cluster 14 from the Sequence Clustering model; so I can refine my previous query or simply look through the results to find the row in which I am interested. Here is the refined query:
SELECT NODE_UNIQUE_NAMEFROM [Sequence Clustering].CONTENTWHERE NODE_CAPTION = 'Cluster 14' |
Query 3: Using the node name in the system stored procedure
Now that I have shown you how to retrieve the Node Name parameter, we can have a look at what happens when we run the stored procedure. Below are the results after running the stored procedure.
Explanation Grouping | Attributes | Values | Frequency | Support |
1 | Model.Transitions | [Start] -> Women's Mountain Shorts | 1 | 447 |
2 | Model | Women's Mountain Shorts | 0.657371 | 294 |
3 | Model.Transitions | 0.342629 | 233 | |
2 | Model | Long-Sleeve Logo Jersey | 0.169987 | 76 |
4 | Model.Transitions | Women's Mountain Shorts, Long-Sleeve Logo Jersey | 0.167331 | 114 |
2 | Model | Cycling Cap | 0.108898 | 49 |
4 | Model.Transitions | Women's Mountain Shorts, Cycling Cap | 0.087649 | 60 |
2 | Model | Racing Socks | 0.063745 | 28 |
4 | Model.Transitions | Women's Mountain Shorts, Racing Socks | 0.059761 | 41 |
Table 2: Results from running the Stored Procedure
The Explanation Grouping column is for reference only. Throughout this article I’ll refer to the Explanation Groupings when discussing the specific stored procedure results. Something else worth pointing out is the word “Model” in the attributes column. In this particular context it means State, or what my current value is.
Grouping 1
I always like to start things from the beginning, so I will begin my investigation by looking at the item from Table 2 that identifies a sequence start point (Explanation Grouping 1). A start location means the start of a sequence. In the table above, there is one possible starting position for this sequence cluster, and this can be identified by looking in the Values column and finding the row or rows that contain the value [Start].
To learn more about how the rows from the stored procedure are generated for Grouping 1, we need to query against the mining model’s content:
SELECT FLATTENED NODE_UNIQUE_NAME,(SELECT ATTRIBUTE_VALUE AS [Product 1],[Support] AS [Sequence Support],[Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION) as tFROM [Sequence Clustering].CONTENTWHERE NODE_TYPE = 13
AND [PARENT_UNIQUE_NAME] = 14 |
Query 4: Identifying nodes in a model that are sequences
This query has a nested table. Nested or hierarchical data sets are common in data mining but can be slightly confusing when you first encounter them. First, look at the outer part of the query. I am selecting from the [Sequence Clustering] model and asking for a NODE_TYPE of 13. This NODE_TYPE is the type that holds the first states of sequences. In my query, I also ask for nodes where PARENT_UNIQUE_NAME = 14. In other words, I want to get all the nodes of a particular type, in this case 13, that come from looking at Cluster 14. It is extremely likely that the UNIQUE_NAME property of the clusters themselves is the cluster number; in this case the number is 14.
Next, look at the inner part of the query, which reads from the nested table “NODE_DISTRIBUTION” that is returned as part of the [Model].CONTENT request. This is shown in Query 5. When taken in isolation like this, it is nothing remarkable, and we start to wonder what all the fuss is about; we simply take some attributes from a nested table.
(SELECT ATTRIBUTE_VALUE AS [Product 1],[Support] AS [Sequence Support], [Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION) as t |
Query 5: Selecting attributes from the NODE_DISTRIBUTION nested table
This portion of the query asks for the Value of the first sequence states, the support/count of cases for the state, and the probability of the state. Below are the abridged results.
NODE_UNIQUE_NAME | t.Product 1 | t.Sequence Support | t.Sequence Probability |
950257 | Touring-2000 | 0 | 1E-14 |
950257 | Touring-3000 | 0 | 1E-14 |
950257 | Water Bottle | 0 | 1E-14 |
950257 | Women's Mountain Shorts | 446.51663663143 | 0.99999999999999 |
Table 3: Results of Query 5
The values for Support and Probability in this query tie in nicely with the values for Support and Frequency in Table 2, Explanation Grouping 1.
Grouping 2
The next values from Table 2 that we are going to look at are in Explanation Grouping 2. These rows show us the distribution and probabilities of reaching a particular state based on the cluster as a whole. They tell us of the possible states within a cluster, states being cycling items, what the support/number of cases is within a particular cluster. The query that follows is slightly different from our first query, because we’re interested in all possible states for the cluster—not just the first states. “[NODE_UNIQUE_NAME] = 14“ again refers to Cluster 14 in our Sequence Clustering model.
SELECT FLATTENED(SELECT ATTRIBUTE_VALUE AS [Product],[Support] AS [Sequence Support],[Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Women''s Mountain Shorts' OR ATTRIBUTE_VALUE = 'Women''s Mountain Shorts' OR ATTRIBUTE_VALUE = 'Racing Socks' OR ATTRIBUTE_VALUE = 'Cycling Cap' OR ATTRIBUTE_VALUE = 'Long-Sleeve Logo Jersey') as tFROM [Sequence Clustering].CONTENTWHERE [NODE_UNIQUE_NAME] = 14 |
Query 6: Restricted Query showing Support and Probability of reaching given states in a specific cluster
Because we know the names of the states in which we are interested (look at the return results from the stored procedure in grouping 2), we have restricted the results returned from the nested table, NODE_DISTRIBUTION, to just those particular states. If we want to see all the possible states, then we can change the query to the following.
SELECT FLATTENED(SELECT ATTRIBUTE_VALUE AS [Product],[Support] AS [Sequence Support],[Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION) as tFROM [Sequence Clustering].CONTENTWHERE [NODE_UNIQUE_NAME] = 14 |
Query 7: Query 6, but it shows all the possible states in Cluster
The results from query 6 are shown below.
t.Product | t.Sequence Support | t.Sequence Probability |
Cycling Cap | 48.62465 | 0.108898 |
Long-Sleeve Logo Jersey | 75.9019 | 0.169987 |
Racing Socks | 28.46321 | 0.063745 |
Women's Mountain Shorts | 293.5269 | 0.657371 |
Table 4: Results of Query 6
Again these results confirm what is returned from the stored procedure. Adding the probabilities together, we find they sum to 100. This is correct and can be confirmed by running query 7 above. All the other possible states have probabilities that are extremely remote; not impossible but nearly. If you want a more graphical way of looking at this information, then go to the Mining Model viewer for your mining model and switch to the Cluster Profiles tab.
Grouping 3
Now let’s move on to explaining the values from Table 2 that are in Explanation Grouping 3. You will immediately notice that this single row differs from all the other rows in the table; it does not have a value in the Values column. We can now look into our model and try to understand why there is no value and what the row is trying to tell us.
We’ll begin by querying against first-state items to find items that our model says are start points for sequences in our chosen cluster. To make this easier on you, I am repeating the query we used in Query 4 to do this.
SELECT FLATTENED NODE_UNIQUE_NAME,(SELECT ATTRIBUTE_VALUE AS [Product 1],[Support] AS [Sequence Support],[Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION) as tFROM [Sequence Clustering].CONTENTWHERE NODE_TYPE = 13AND [PARENT_UNIQUE_NAME] = 14 |
Query 8: Find first States in our model for cluster 14
The results of this query were shown abridged in Table 3, so I will not copy them here. The query returns the possible first states for Cluster 14. As we can see, there is only one real possibility for a start state in this cluster: Women’s Mountain Shorts. If you look through the result of Query 8, you will notice that the states are listed alphabetically, except for the first state, which is Missing, and this is always listed first. The location of our first-state row in the table of results returned from Query 8 is very important. To find out the state of Women’s Mountain Shorts, we need to count down the rows until we reach Women’s Mountain Shorts. Counting should start at 0; remember Row 0 is always the Missing state. In our example, Women’s Mountain Shorts is row number 37. We’ll use this value to discover the possible seconds from this first state.
Discovering the second state is a two-part process: first we’ll identify the node that holds the second state for sequence state 37 (Query 9), and then we’ll query that node for the second state values (Query 10). For Query 9, we are also going to need the NODE_UNIQUE_NAME from Table 2 as input for the PARENT_UNIQUE_NAME restriction.
SELECT NODE_UNIQUE_NAMEFROM [Sequence Clustering].CONTENTWHERE NODE_DESCRIPTION = 'Transition row for sequence state 37'AND [PARENT_UNIQUE_NAME] = '950257' |
Query 9: Finding the node that holds second states of the sequence, where the first state is Women's Mountain Shorts
Query 9 identifies node 950295 as holding the second states. We use this information in Query 10 to retrieve the possible second states.
SELECT FLATTENED(SELECT ATTRIBUTE_VALUE AS Product2,[Support] AS [P2 Support],[Probability] AS [P2 Probability]FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing') AS tFROM [Sequence Clustering].CONTENTWHERE NODE_UNIQUE_NAME = '950295' |
Query 10: Retrieving the row from our model, where Women's Mountain Shorts is the first state and “Missing” is the second
t.Product2 | t.P2 Support | t.P2 Probability |
Missing | 232.7299 | 0.521212 |
Table 5: Query 10 Results
We now have the support (the number of cases) for the second row of the “missing” states in Table 2. The probability is not what was returned from the original procedure, but I’ll explain that in a moment.
If we had more than one possible starting state for our cluster, then we would repeat this process, but we would put in the respective values from the other start items.
So what about the probability returned by query 10? The following table shows the “Missing” row from Table 2 to make it easier to refer back to the stored procedure that we originally executed.
Attributes | Values | Frequency | Support |
Model.Transitions | 0.342629 | 233 |
Table 6: ”Missing” row from Table 2
To understand why this happens, let’s take a closer look at the way the frequency values are calculated for Query 10:
(Support from Query 10 * probability from query 10) / (Support from Query 10)
(232.7299 * .521212) / (232.7299) = 0.521212
So why is our value from the stored procedure so much lower? The formula below explains the frequency returned from the stored procedure.
P(A)*P(B|A)
In order to get our probability in the results from the stored procedure, we need to take the probability for Women’s Mountain Shorts in our cluster, which is 0.657371, and then multiply that by the value for Missing from Query 10, which is 0.521212. We take Women’s Mountain Shorts because it is identified as the only start position for a sequence in this cluster.
0.657371 * 0.521212 = 0.342629
This matches exactly with what is in Table 2.
Grouping 4
Women’s Mountain Shorts is the only state in this cluster that is followed by anything other than a missing value. To find out more about transitions from this first state that do not result in Missing, I am going to use a slightly modified version of Query 10. The restrictions on the next table, NODE_DISTRIBUTION, are taken from Explanation Grouping 4 in Table 2.
SELECT FLATTENED(SELECT ATTRIBUTE_VALUE AS Product2,[Support] AS [P2 Support],[Probability] AS [P2 Probability]FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Racing Socks' OR ATTRIBUTE_VALUE = 'Cycling Cap' OR ATTRIBUTE_VALUE = 'Long-Sleeve Logo Jersey') AS tFROM [Sequence Clustering].CONTENTWHERE NODE_UNIQUE_NAME = '950295' |
Query 11: Looking at the second state of “Mountain Tire Tube,” where “ML Mountain Tire” was the first
Here are the results:
t.Product2 | t.P2 Support | t.P2 Probability |
Cycling Cap | 59.5355516 | 0.133333 |
Long-Sleeve Logo Jersey | 113.65878 | 0.254545 |
Racing Socks | 40.5924215 | 0.090909 |
Table 7: Results of query 11
The stored procedure returns a Support attribute that matches Table 2. To compare the validity of the Probability attribute, we calculate the frequency in Table 2 by applying the same logic we used in Grouping 3. We know from Table 2 that Women’s Mountain Shorts has a probability of 0.657371, which we will take to be a constant.
t.Product2 | t.P2 Support | t.P2 Probability | Constant | P(A)*P(B|A) |
Cycling Cap | 59.53555 | 0.133333 | 0.657371 | 0.087649 |
Long-Sleeve Logo Jersey | 113.6588 | 0.254545 | 0.657371 | 0.167331 |
Racing Socks | 40.59242 | 0.090909 | 0.657371 | 0.059761 |
Table 8: Comparing the validity
Conclusion
The stored procedures used by the viewers give us a lot of information, and with a little bit of digging, we can really start to understand our models. While the cluster used in this example is relatively simple to view, the information we have learned from looking at it can be applied across more complex clusters. Thank you for reading this article. Please send us your feedback to let us know how useful this was or if you have any other comments.
Author of the original article (before wiki edits): Allan Mitchell is a SQL Server MVP based in the UK. He specializes in the Microsoft SQL Server BI stack with a passion for Data Mining and SQL Server Integration Services. You can find Allan’s website at www.SQLDTS.com.
See Also