다음을 통해 공유


SSAS: Explaining the GetClusterCharacteristics Stored Procedure for Sequence Clustering Models

Original content by Allan Mitchell, SQL Server MVP and principal consultant at Konesans Limited in the UK. You can find Allan’s website at www.SQLIS.com*.

*NOTE: Alternatively, you can download a Word file of this whitepaper at TechNet Gallery. This is for several potential reasons: (1) To easily print it. (2) To see the original version. (3) To customize it for personal use. (4) To share an offline copy.   

Introduction

Beginning with SQL Server 2005 Analysis Services, Microsoft introduced a new range of modeling algorithms to be used for [[data mining]], including Sequence Clustering. Along with these algorithms, Microsoft also provides viewers that allow you to see exactly what is going on inside the models. If you run SQL Server Profiler while using the viewers, you will discover that a number of stored procedures are being called behind the scenes. In this article, I take a closer look at one of those procedures, GetClusterCharacteristics, as used against a Sequence Clustering model. This procedure provides a lot of useful information about a clustering model that I can leverage to better understand the data and my model.

In this article:


The Problem

The stored procedures that are called when using the viewers are internal to Microsoft, and documentation is sparse. The procedures provide a lot of useful information and offer insight into what is happening behind the scenes in the viewers. To help you understand this information, I’ll step through the use of the GetClusterCharacteristics stored procedure.

For this article, I am using SQL Server 2008, but the behaviors of the stored procedure, the viewer, and of SQL Server Profiler are also valid for SQL Server 2005. The model used is part of the sample Analysis Services project, which can be found here:

http://msftasprodsamples.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=34035

 ↑ Back to top

Sequence Clustering Models

Before we go any further, I want to give an overview of Sequence Clustering models. This will help in understanding how things fit together and what Analysis Services Data Mining is doing.

Sequence Clustering is employed when we want to analyze events that have an inherent order to them, examples being webpage page navigation and protein structures, and then we cluster/group them based on thresholds within the sequence patterns.

Sequence clustering is really a mash up of two techniques. The first thing that happens is the sequences in the data are analyzed for patterns. This is done through the use of nth order Markov chains, and you can follow the URL below for a fuller explanation.

http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf

After the chains in our data have been analyzed, a clustering technique is applied to those chains in order to group them together. The two choices of clustering techniques you have in SQL Server 2008 are K Means and Expectation Maximization (the default). A further discussion of clustering can be found here:

http://gauss.nmsu.edu/~lludeman/video/ch6pr.html

Although many Data Mining software providers provide sequence clustering models, the implementation of them differs slightly. Here is the technical reference of Microsoft’s implementation:

http://msdn.microsoft.com/en-us/library/cc645866.aspx

 ↑ Back to top


The Procedure

The GetClusterCharacteristics procedure takes 3 parameters:

  • Model Name. The name of the mining model.
  • Node Name. The name of the node you want to query. For help in understanding nodes, see http://msdn.microsoft.com/en-us/library/aa178243%28SQL.80%29.aspx
  • Probability Threshold. A numeric value representing the minimum acceptable probability. Rows that do not meet this minimum are filtered out.

We can call the stored procedure from the DMX query editor in SQL Server 2008 Management Studio like this:

CALL System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('Sequence Clustering','14',0.0005)

Query 1: Calling the stored procedure

The only parameter that will need a little explaining is the second parameter, Node Name.

The Node Name parameter can be retrieved by executing the following query in the DMX query editor in Management Studio.

SELECT NODE_CAPTION, NODE_UNIQUE_NAMEFROM [Sequence Clustering].CONTENT

Query 2: Getting the unique node name

Because the results from Query 2 are potentially very large, I am going to show only a few rows, but this should give us enough information to understand what the query returns.

NODE_CAPTION NODE_UNIQUE_NAME
All 0
Cluster 1 1
Cluster 2 2
Cluster 3 3
Cluster 4 4
Cluster 5 5
Cluster 6 6
Cluster 7 7
Cluster 8 8
Cluster 9 9
Cluster 10 10
Cluster 11 11
Cluster 12 12
Cluster 13 13
Cluster 14 14
Cluster 15 15
Sequence level for cluster 16 1081327
Transition row for sequence state 0 1081328
Transition row for sequence state 1 1081329
Transition row for sequence state 2 1081330
Transition row for sequence state 3 1081331
Transition row for sequence state 4 1081332

Table 1: Results of Query 2

For this article, I am interested in looking at the characteristics of cluster 14 from the Sequence Clustering model; so I can refine my previous query or simply look through the results to find the row in which I am interested. Here is the refined query:

SELECT NODE_UNIQUE_NAMEFROM [Sequence Clustering].CONTENTWHERE NODE_CAPTION = 'Cluster 14'

Query 3: Using the node name in the system stored procedure

Now that I have shown you how to retrieve the Node Name parameter, we can have a look at what happens when we run the stored procedure. Below are the results after running the stored procedure.

Explanation Grouping Attributes Values Frequency Support
1 Model.Transitions [Start] -> Women's Mountain Shorts 1 447
2 Model Women's Mountain Shorts 0.657371 294
3 Model.Transitions   0.342629 233
2 Model Long-Sleeve Logo Jersey 0.169987 76
4 Model.Transitions Women's Mountain Shorts, Long-Sleeve Logo Jersey 0.167331 114
2 Model Cycling Cap 0.108898 49
4 Model.Transitions Women's Mountain Shorts, Cycling Cap 0.087649 60
2 Model Racing Socks 0.063745 28
4 Model.Transitions Women's Mountain Shorts, Racing Socks 0.059761 41

Table 2: Results from running the Stored Procedure

The Explanation Grouping column is for reference only. Throughout this article I’ll refer to the Explanation Groupings when discussing the specific stored procedure results. Something else worth pointing out is the word “Model” in the attributes column. In this particular context it means State, or what my current value is.

 ↑ Back to top

Grouping 1

I always like to start things from the beginning, so I will begin my investigation by looking at the item from Table 2 that identifies a sequence start point (Explanation Grouping 1). A start location means the start of a sequence. In the table above, there is one possible starting position for this sequence cluster, and this can be identified by looking in the Values column and finding the row or rows that contain the value [Start].

To learn more about how the rows from the stored procedure are generated for Grouping 1, we need to query against the mining model’s content:

SELECT FLATTENED NODE_UNIQUE_NAME,(SELECT ATTRIBUTE_VALUE AS [Product 1],[Support] AS [Sequence Support],[Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION) as tFROM [Sequence Clustering].CONTENTWHERE NODE_TYPE = 13

AND [PARENT_UNIQUE_NAME] = 14

Query 4: Identifying nodes in a model that are sequences

This query has a nested table. Nested or hierarchical data sets are common in data mining but can be slightly confusing when you first encounter them. First, look at the outer part of the query. I am selecting from the [Sequence Clustering] model and asking for a NODE_TYPE of 13. This NODE_TYPE is the type that holds the first states of sequences. In my query, I also ask for nodes where PARENT_UNIQUE_NAME = 14. In other words, I want to get all the nodes of a particular type, in this case 13, that come from looking at Cluster 14. It is extremely likely that the UNIQUE_NAME property of the clusters themselves is the cluster number; in this case the number is 14.

Next, look at the inner part of the query, which reads from the nested table “NODE_DISTRIBUTION” that is returned as part of the [Model].CONTENT request. This is shown in Query 5. When taken in isolation like this, it is nothing remarkable, and we start to wonder what all the fuss is about; we simply take some attributes from a nested table.

(SELECT ATTRIBUTE_VALUE AS [Product 1],[Support] AS [Sequence Support], [Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION) as t

Query 5: Selecting attributes from the NODE_DISTRIBUTION nested table

This portion of the query asks for the Value of the first sequence states, the support/count of cases for the state, and the probability of the state. Below are the abridged results.

NODE_UNIQUE_NAME t.Product 1 t.Sequence Support t.Sequence Probability
950257 Touring-2000 0 1E-14
950257 Touring-3000 0 1E-14
950257 Water Bottle 0 1E-14
950257 Women's Mountain Shorts 446.51663663143 0.99999999999999

Table 3: Results of Query 5

The values for Support and Probability in this query tie in nicely with the values for Support and Frequency in Table 2, Explanation Grouping 1.

 ↑ Back to top

Grouping 2

The next values from Table 2 that we are going to look at are in Explanation Grouping 2. These rows show us the distribution and probabilities of reaching a particular state based on the cluster as a whole. They tell us of the possible states within a cluster, states being cycling items, what the support/number of cases is within a particular cluster. The query that follows is slightly different from our first query, because we’re interested in all possible states for the cluster—not just the first states. “[NODE_UNIQUE_NAME] = 14“ again refers to Cluster 14 in our Sequence Clustering model. 

SELECT FLATTENED(SELECT ATTRIBUTE_VALUE AS [Product],[Support] AS [Sequence Support],[Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION       WHERE ATTRIBUTE_VALUE = 'Women''s Mountain Shorts'       OR ATTRIBUTE_VALUE = 'Women''s Mountain Shorts'       OR ATTRIBUTE_VALUE = 'Racing Socks'       OR ATTRIBUTE_VALUE = 'Cycling Cap'       OR ATTRIBUTE_VALUE = 'Long-Sleeve Logo Jersey') as tFROM [Sequence Clustering].CONTENTWHERE [NODE_UNIQUE_NAME] = 14

Query 6: Restricted Query showing Support and Probability of reaching given states in a specific cluster

Because we know the names of the states in which we are interested (look at the return results from the stored procedure in grouping 2), we have restricted the results returned from the nested table, NODE_DISTRIBUTION, to just those particular states. If we want to see all the possible states, then we can change the query to the following.

SELECT FLATTENED(SELECT ATTRIBUTE_VALUE AS [Product],[Support] AS [Sequence Support],[Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION) as tFROM [Sequence Clustering].CONTENTWHERE [NODE_UNIQUE_NAME] = 14

Query 7: Query 6, but it shows all the possible states in Cluster

The results from query 6 are shown below.

t.Product t.Sequence Support t.Sequence Probability
Cycling Cap 48.62465 0.108898
Long-Sleeve Logo Jersey 75.9019 0.169987
Racing Socks 28.46321 0.063745
Women's Mountain Shorts 293.5269 0.657371

Table 4: Results of Query 6

Again these results confirm what is returned from the stored procedure. Adding the probabilities together, we find they sum to 100. This is correct and can be confirmed by running query 7 above. All the other possible states have probabilities that are extremely remote; not impossible but nearly. If you want a more graphical way of looking at this information, then go to the Mining Model viewer for your mining model and switch to the Cluster Profiles tab.

 ↑ Back to top

Grouping 3

Now let’s move on to explaining the values from Table 2 that are in Explanation Grouping 3. You will immediately notice that this single row differs from all the other rows in the table; it does not have a value in the Values column. We can now look into our model and try to understand why there is no value and what the row is trying to tell us.

We’ll begin by querying against first-state items to find items that our model says are start points for sequences in our chosen cluster. To make this easier on you, I am repeating the query we used in Query 4 to do this.

SELECT FLATTENED NODE_UNIQUE_NAME,(SELECT ATTRIBUTE_VALUE AS [Product 1],[Support] AS [Sequence Support],[Probability] AS [Sequence Probability]FROM NODE_DISTRIBUTION) as tFROM [Sequence Clustering].CONTENTWHERE NODE_TYPE = 13AND [PARENT_UNIQUE_NAME] = 14

Query 8: Find first States in our model for cluster 14

The results of this query were shown abridged in Table 3, so I will not copy them here. The query returns the possible first states for Cluster 14. As we can see, there is only one real possibility for a start state in this cluster: Women’s Mountain Shorts. If you look through the result of Query 8, you will notice that the states are listed alphabetically, except for the first state, which is Missing, and this is always listed first. The location of our first-state row in the table of results returned from Query 8 is very important. To find out the state of Women’s Mountain Shorts, we need to count down the rows until we reach Women’s Mountain Shorts. Counting should start at 0; remember Row 0 is always the Missing state. In our example, Women’s Mountain Shorts is row number 37. We’ll use this value to discover the possible seconds from this first state.

Discovering the second state is a two-part process: first we’ll identify the node that holds the second state for sequence state 37 (Query 9), and then we’ll query that node for the second state values (Query 10). For Query 9, we are also going to need the NODE_UNIQUE_NAME from Table 2 as input for the PARENT_UNIQUE_NAME restriction.

SELECT NODE_UNIQUE_NAMEFROM [Sequence Clustering].CONTENTWHERE NODE_DESCRIPTION = 'Transition row for sequence state 37'AND [PARENT_UNIQUE_NAME] = '950257'

Query 9: Finding the node that holds second states of the sequence, where the first state is Women's Mountain Shorts

Query 9 identifies node 950295 as holding the second states. We use this information in Query 10 to retrieve the possible second states.

SELECT FLATTENED(SELECT ATTRIBUTE_VALUE AS Product2,[Support] AS [P2 Support],[Probability] AS [P2 Probability]FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing') AS tFROM [Sequence Clustering].CONTENTWHERE NODE_UNIQUE_NAME = '950295'

Query 10: Retrieving the row from our model, where Women's Mountain Shorts is the first state and “Missing” is the second 

t.Product2 t.P2 Support t.P2 Probability
Missing 232.7299 0.521212

Table 5: Query 10 Results

We now have the support (the number of cases) for the second row of the “missing” states in Table 2. The probability is not what was returned from the original procedure, but I’ll explain that in a moment.

If we had more than one possible starting state for our cluster, then we would repeat this process, but we would put in the respective values from the other start items.

So what about the probability returned by query 10? The following table shows the “Missing” row from Table 2 to make it easier to refer back to the stored procedure that we originally executed.

Attributes Values Frequency Support
Model.Transitions   0.342629 233

Table 6: ”Missing” row from Table 2

To understand why this happens, let’s take a closer look at the way the frequency values are calculated for Query 10:

(Support from Query 10 * probability from query 10) / (Support from Query 10)

(232.7299 * .521212) / (232.7299) = 0.521212

So why is our value from the stored procedure so much lower? The formula below explains the frequency returned from the stored procedure.
P(A)*P(B|A)

In order to get our probability in the results from the stored procedure, we need to take the probability for Women’s Mountain Shorts in our cluster, which is 0.657371, and then multiply that by the value for Missing from Query 10, which is 0.521212. We take Women’s Mountain Shorts because it is identified as the only start position for a sequence in this cluster.

0.657371 * 0.521212 = 0.342629

This matches exactly with what is in Table 2.

 ↑ Back to top

Grouping 4

Women’s Mountain Shorts is the only state in this cluster that is followed by anything other than a missing value. To find out more about transitions from this first state that do not result in Missing, I am going to use a slightly modified version of Query 10. The restrictions on the next table, NODE_DISTRIBUTION, are taken from Explanation Grouping 4 in Table 2.

SELECT FLATTENED(SELECT ATTRIBUTE_VALUE AS Product2,[Support] AS [P2 Support],[Probability] AS [P2 Probability]FROM NODE_DISTRIBUTION       WHERE ATTRIBUTE_VALUE = 'Racing Socks'       OR ATTRIBUTE_VALUE = 'Cycling Cap'       OR ATTRIBUTE_VALUE = 'Long-Sleeve Logo Jersey') AS tFROM [Sequence Clustering].CONTENTWHERE NODE_UNIQUE_NAME = '950295'

Query 11: Looking at the second state of “Mountain Tire Tube,” where “ML Mountain Tire” was the first

Here are the results:

t.Product2 t.P2 Support t.P2 Probability
Cycling Cap 59.5355516 0.133333
Long-Sleeve Logo Jersey 113.65878 0.254545
Racing Socks 40.5924215 0.090909

Table 7: Results of query 11

The stored procedure returns a Support attribute that matches Table 2. To compare the validity of the Probability attribute, we calculate the frequency in Table 2 by applying the same logic we used in Grouping 3. We know from Table 2 that Women’s Mountain Shorts has a probability of 0.657371, which we will take to be a constant.

t.Product2 t.P2 Support t.P2 Probability Constant P(A)*P(B|A)
Cycling Cap 59.53555 0.133333 0.657371 0.087649
Long-Sleeve Logo Jersey 113.6588 0.254545 0.657371 0.167331
Racing Socks 40.59242 0.090909 0.657371 0.059761

Table 8: Comparing the validity

 ↑ Back to top


Conclusion

The stored procedures used by the viewers give us a lot of information, and with a little bit of digging, we can really start to understand our models. While the cluster used in this example is relatively simple to view, the information we have learned from looking at it can be applied across more complex clusters. Thank you for reading this article. Please send us your feedback to let us know how useful this was or if you have any other comments.

Author of the original article (before wiki edits): Allan Mitchell is a SQL Server MVP based in the UK. He specializes in the Microsoft SQL Server BI stack with a passion for Data Mining and SQL Server Integration Services. You can find Allan’s website at www.SQLDTS.com.

 ↑ Back to top


See Also