Data Mining not dead, but possibly forgotten
With the launch of SQL Server 2008 R2 and SQL Server Denali ctp3 there has been ever more BI functionality built in but nothing has been mentioned about the strange world of data mining. Data mining has been in SQL Server since analysis services was launched but I have rarely seen it being used - it doesn’t seem to have entered mainstream consciousness in the way that some of the cool reporting front end tools have possibly because it smacks a little of black magic, or is seen as unnecessary or hard.
So here’s a quick post to get you thinking and hopefully encourage you to take it for a spin.
What exactly is Data Mining?
The simple answer is the ability to derive insights from your data that you didn’t know were there. Classic examples of use are customers who bought this also bought that (what is called basket analysis in the retail world) and trying to understand what factors influence a purchase.
How does it work?
There are a number of approaches, possibly the most famous being neural networking. In fact there are six different techniques built into analysis services each designed to do a particular job:
- Association is used for that basket analysis I mention and recommending additional items customers may want to purchase.
- Clustering groups such things as customers by detecting what is not obvious form casual observation. You may have heard a story a few years ago of the banks segregating customers into cherries (good customers) through to lemons (bad customers) and this might been done better using clustering rather than a trivial exercise around what revenue they bring in.
- Decision Trees try and predict the outcome of an attribute by understanding what factors determine the value of that attribute. The way the prediction works depends on what is being predicted and whether this is discrete (values such as yes or no, red green or blue and so on) or continuous (e.g. a number like value or price). So this could be used to work out what factors make a up a decision to purchase e.g. age gender, income, homeowner and so on.
- Linear Regression. You may remember this one form matches at school, the process of trying to draw a line through a series of dots to best represent the trend of those dots on a graph. This is used by the decision trees algorithm where the predicted column is continuous.
- Naive Bayes is the method for determine which customers would most likely respond to a marketing campaign, the naive term means that no relationship is assumed between any of the factors affecting the outcome.
- Logistic Regression. Every week you’ll hear something like “women drivers over 40 are 50% less likely to have a car accident” and this would be the algorithm that would be used to discover that relationship. It can also be used to discover which are the most important factors in determining an outcome.
- Sequence Clustering is similar to clustering but the data contains a sequence of events and this enables you to explore such things as a users navigation of a web site (known as click stream analysis).
- Time Series is used for forecasting a continuous value such as sales or profit.
- Neural Network is the non of the above option for really complicated input data. The mechanism for determining the outcome mimics the way neurons work in the brain and doesn’t actually show you how its results are derived at. Text mining is one example that uses this technique.
In all of these techniques you identify a sample set of data which has the values of the object you are trying to predict, from which you create a model. You then apply this model to live data to do your prediction and analysis.
How do I do this?
There are three ways of creating models:
1. You can create mining models in BI development studio as part of an analysis services project in SQL Server 2008 R2 or in the beta of the new version Project Denali and there are examples mining models in the sample adventure works analysis service project from Codeplex:
2. You can code a model directly in Data Mining Expressions (DMX) which might be appropriate for embedding models into an application such as an ecommerce site.
3. There’s an Excel add-in for data mining :
This works by taking the data in a spreadsheet and then posting it to an instance of analysis services for processing and then returning the results back to Excel. Originally designed for Excel 2007 it does work in Excel 2010 but only for 32bit installations, and while it can’t use data in the PowerPivot window in PowerPivot for Excel it can consume data in Excel pivot tables derived from PowerPivot data. Note that when I mention analysis services here, I mean classic analysis services in olap mode not the new shiny tabular based analysis services that also exists in SQL Server Denali.
Why Bother?
First of all you are probably licensed for all of this now, you’ll have SQL Server standard or higher around somewhere you can use this as the mining engine and a copy of Excel for the front end (the excel add-in for data mining isn’t a separate license). However this might all seem a bit abstract and you might have a hard time convincing yourself never mind your users that this is worth the effort. One idea I have is what if you took the data out of some of the data you have access to see how this stuff works, for example data from System Center, or whatever data centre management tools you have (dare I mention VSphere?) your helpdesk system or asset management tools to get insight on what is really going on. This might be a useful project in it’s own right but as the data is not sensitive you can share it as an example of what’s possible and possibly impress your manager as well as the business owners.
Comments
Anonymous
August 07, 2011
The data mining add ins for office do not function last time I checked in Office 2010. Hoping something comes out for PowerPivotAnonymous
August 07, 2011
Ron That screenshot in the post above is Data mning and PowerPivot in Excel 2010, albeit 32bit Excel, so not sure what porblems you are having. However data mining built in to PowerPivot or BISM would be nice so post your thoughts on Connect. AndrewAnonymous
August 08, 2011
I would love to be able to use this feature. I've got no luck with 64 bit Excel 2010. Is there a connect item for this? The lack of support me that Microsoft might be forgetting about this feature too. The other options open to me are learning DMX (Is it as hard as MDX? Because that's daunting) or creating a model in BIDS. How hard is that to learn do you think?Anonymous
August 10, 2011
Michael I guess you would log this under Denali. I completely agree that the lack of and x64 version needs to be addressed. Although PowerPivot and Data Mining sit side by side in Excel they don't really work together. My other wish is for a decoupled data mining tool that doesn't sit in BI dev studio, much as we have with Report Builder 3. However your voice carries more weigth than mine as you are a customer so Connect away please and get your colleagues to vote as wellAnonymous
September 16, 2011
Hopefully I'm not hijacking someone's thread... I tried getting the existing Data Mining add-ins to work in Excel 2010 (worked fine in 2007), and it causes Excel to crash every time. I uninstalled 2010 and went back to 2007, and still the add-ins crash. Do I need to clean the registry or something... ??Anonymous
September 18, 2011
Pieter Not at all sure what the problem is but it will work (as you can see from my screenshot), and we do support it. However I am not a support guy so you'll need either check technet forums or rais a call with us to resolve your issue. AndrewAnonymous
November 18, 2011
The comment has been removedAnonymous
October 16, 2012
Not only is DMX dead (no changes since SQL 2005), but it looks like MDX is being abandoned too - the core devs have left MS and the focus has shifted to DAX - no new MDX features in SQL 2012.Anonymous
August 07, 2014
Adventure Works!Anonymous
August 07, 2014
Adventure Works!