Data Mining and Predictive Analytics
This page is in progress. Please hold off on making edits till I can add more content and clean it up a bit! |
Overview
Microsoft provides many products in support of business intelligence and analytics. Among these, SQL Server Analysis Services (SSAS) supports both the OLAP solutions that are central to business intelligence, and a rich set of [[data mining]] features that allow users to build advanced statistical models of their data and use those models to categorize data and make predictions based on that data.
If SQL Server data mining does not meet your needs, there are other modeling tools available to you, including extensions to Excel, third-party add-ins for Excel, T-SQL libraries, and of course code samples that you can customize. These may require some programming.
This page provides an answer to some of the frequently asked questions about SQL Server Data Mining, as well as links to other resources. If you are looking for the official content, please see the MSDN library.
This page is intended to provide a place where people who are enthusiastic (or even lukewarm!) about predictive analytics can share information, tools, samples, and other resources. So please feel welcome to contribute.
Note that there is a separate Wiki Community page for SQL Server Analysis Services and one for PowerPivot. And don't miss the DAX Resource Center if you are looking for formula tips!
For information on visualizations, check out the Wiki Community page for Power View.
Examples of Articles for this DM Wiki
Please feel welcome to add links to the following sections, create new articles, or or recommend articles that can be referenced from this Wiki.
Articles included here should touch on at least one of these subjects:
- Applications of data mining or predictive analytics, using any of the Microsoft tools (including Transact SQL!)
- Queries and applications using DMX and XMLA
- Predictive analytics using tabular or OLAP models
- Visualization as a way of modeling or interpreting statistical models
- Performance tuning and monitoring of data mining solutions
- Business cases and real-life scenarios for predictive analytics
- General discussion of how to model data for use in prediction
- Classical statistical modeling vs. data mining: are there differences?
- Use of AMO and interfaces such as OLEDB to programmatically work with data mining objects.
Predictive Modeling 101
Read this section if you are new to data mining or are looking for community resources.
I am new to data mining; what should I read to get started? (books)
Consider this an annotated bibliography of my favorites. There are probably other great books so add yours! These aren't technical books, but are targeted more at someone just getting interested in predictive analytics who wants a good grounding in the techniques and pitfalls.
[1] Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management
Michael Berry and Gordon Linoff.
Wiley, 2004.
Even if you don't work in any of the subtitled fields, you will find this book readable, applicable, and even enjoyable. They explain many of the common algorithms, but also explain the groundwork that must be laid before you do any kind of analysis, and walk you through understanding results and how yo can apply them in real-world scenarios.
[2] The Practice of Business Statistics: Using Data for Decisions
David Moore, George McCabe et all
W.H. Freeman, 2003.
Forget everything you learned in that college statistics class? Keep this book on your bookshelf (or better still, download their free library of samples) and you'll never be caught flat without an explanation of lurking variables or covariance. The explanations are clear, the graphs are excellent, and they painstakingly explain concepts and misconceptions.
[3] Data Preparation for Data Mining
Dorian Pyle.
Morgan Kaufman, 1999.
This book is an essential for all aspects of the data preparation process. It covers working with different types of dates and times, discrete vs. continuous data (what to bin and why and when), all kinds of aggregations and conversions, how to apply them and when, and how data preprocessing affects the results (=quality) of models. The author provides clear explanations of techniques such as differencing and normalization and missing data inference, and discusses the pros and cons of different approaches in the context of modeling.
[4] Data Mining with SQL Server 2008
Jamie MacLennan, Bogdan Crivat, ZhaoHui Tang
Wiley, 2009.
If you are using SQL Server data mining, you'll want this book, period. Jamie is a great writer and makes the concepts behind all the algorithms easy to understand. Bogdan has provided a huge amount of technical detail that will help you in creating custom DMX or maybe your own algorithms. Providers, data sources, AMO, XMLA, walkthroughs of each algorithms shipped with SQL Server... it's all here.
Where can I find videos or demos of data mining?
[1] DM Add-ins videos
http://msdn.microsoft.com/en-us/library/dd776389(SQL.100).aspx
This page lists a series of short videos by UE writers Michele and Mary that highlight the wizards in the Data Mining Add-ins for Excel. The videos were created for SQL Server 2008, but the examples are equally applicable to SQL Server 2008 R2 and 2012.
This videos demonstrates the Key Influencers tool and explains how it compares to Decision Trees.
http://www.analyticsinaction.com/analyze-key-influencers-tool-demonstration/
[2] Building or viewing models using BI Development Studio.
- Sequence clustering for fraud detection: http://technet.microsoft.com/en-us/sqlserver/jj149087.aspx
- Mining cube data: http://technet.microsoft.com/en-us/sqlserver/gg678592.aspx
This is a nice demo that shows how you might start in the Data Mining Add-ins for Excel and move into BI Development Studio:
http://www.analyticsinaction.com/creating-data-mining-structures-predictive-models-using-the-excel-add-in-for-sql-server-2008/
This video describes how to create and use lifts charts and cross-validation. These are important tools for testing models and for validating the representativeness of your data set. You can use these tools either in BI Development Studio (SQL Server Data Tools) or in the Data Mining Client for Excel.
http://www.analyticsinaction.com/using-lift-accuracygains-charts-to-evaluate-data-mining-models-in-sql-server-2008/
[3] Using data mining with PowerPivot
Data Mining MVP Mark Tabladillo created a series of videos on data mining with PowerPivot. He shows three different model types to illustrate how data mining complements and reinforces traditional BI.
- Category analysis (Part 1): http://www.youtube.com/watch?v=aiwrgEKwkHs&feature=results_video
- Promotions analysis (Part 2): http://www.youtube.com/watch?v=yWXF2nPNMXg&feature=relmfu
- Marriage scoring (Part 3): http://www.youtube.com/watch?v=6Vz8MX5ksvQ&feature=relmfu
Not a video, but a detailed walkthrough, with sample data, from Analysis Services PM Kasper deJonge. It demonstrates how to use results from a data mining model in PowerPivot and Power View:
http://www.powerpivotblog.nl/tutorial-optimize-a-sample-powerpivot-model-for-power-view-reports
[4] Data Mining Webcasts
Here are a few of the webcasts available on the Microsoft site.
BASIC
Intro to the add-ins: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032525594&culture=en-us
Your first DM Project: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032355531&culture=en-us
Building Adaptive applications with DM: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032356232&culture=en-us
Understanding the algorithms: part of a series by MVP Lynn Langit: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032516178&culture=en-us
Testing and Validating your data mining models: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032359364&culture=en-us
ADVANCED
Developer's Perspective on getting started with DM: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032297087&culture=en-us
Using Data mining in data quality workflow: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032359360&culture=en-us
Application Integration with the Excel DM ribbon: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032518358&culture=en-us
Want more?
This list of webcasts is older, but most are still available and still useful.
http://www.sqlserverdatamining.com/ssdm/Home/Webcasts/tabid/62/Default.aspx
Community Resources and Web pages
The official TechNet product page for SQL Server Data Mining is here. We try to put links to downloads and new videos here as soon as possible.
http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx
This site created by the original SQL Server Data Mining team has many resources. Be sure to look through the tips and samples!
http://www.sqlserverdatamining.com/
The data mining forum on MSDN is very active with lots of helpful folks answering questions.
http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/threads
Want to see some case studies or examples of data mining in action
First, check out the Microsoft Case Study site. It's searchable.
http://www.microsoft.com/casestudies/
EB Games uses SQL Server Data Mining, and although a formal case study has yet to be published, have discussed their use of data mining in the press.
http://www.computerworld.com.au/article/436433/tech_ed_2012_eb_games_reaches_high_scores_data_mining/
http://www.gizmodo.com.au/2012/09/eb-games-explains-big-data-strategies/
INRIX Traffic Services leverages sophisticated statistical analysis techniques developed by Microsoft Research to aggregate and enhance traffic-related information from hundreds of public and private sources, including traditional road sensors and the company’s unique network of over 800,000 GPS-enabled vehicles and cellular probes.
http://inrix.com/
FAQ
This section provides answers to some of the questions asked most often.
Why is data mining included as part of Microsoft Analysis Services? Do you need a cube to do data mining?
The data mining features in SQL Server Analysis Services are tightly integrated with both relational and OLAP data sources, to take advantage of the superior speed and functionality of both the database engine and multidimensional engine for retrieving data and creating and storing calculations. Tight integration with other SQL Server products also makes it easy to build predictive analytics solutions in Visual Studio, and to integrate predictive modeling with your processes for data acquisition, data cleansing, reporting, and BI analytics.
But you *definitely* don't need a cube to get started.
- The Data Mining Add-ins use data that is in Excel -- it has to be stored as an Excel table, but that's easy.
- When you create models using BI Development Studio (or SQL Server Data Tools), the Data Mining wizard first asks where your data is coming from -- a cube or relational source -- and provides screens that guide you through choosing the data and applying the model. It's much easier with relational data.
That said, an OLAP cube is a great source of data, since it is already cleaned, formatted, and aggregated in interesting ways. Also, if you have created a mining model based on OLAP data, you can save the results as a data mining dimension. A data mining dimension is an OLAP table that stores the patterns and results in a format that you can use to explore correlations, drill through to details, and get new insight. You can also use the DM dimension as a source, like any dimension table.
This summary compares the two methods of building a data mining model, relational and from a cube: http://msdn.microsoft.com/en-us/library/ms175645.aspx
We recommend you start with relational models.
Typically you will create some kind of view for data mining, to make sure the data is at the right granularity, and to ensure that the data has attributes that are useful for modeling.
For example, if your database captures transactions time-stamped in seconds, you might want to roll up the data into days, weeks, or weeks months, to make it easier to capture patterns over longer periods. If you are analyzing items that occur together and you don't have enough items, you might try collapsing them into categories.
Remember, the point of modeling is to arrive at an abstraction that is just good enough to be useful.
To see an example of how you might prepare relational data for modeling, take a look at these views in the AdventureWorksDW sample database:
- vDMPrep
- vTargetMail
- vTimeSeries
- vAssocSeqOrders
- vAssocSeqItems
Looking at these views, and how they aggregate and simplify data, will help you understand how you can create variables that are useful for clustering, forecasting, or market basket analysis using data in a traditional data warehouse.
What can I use data mining for? How can I apply it in my business?
The introduction to the data mining section of Books Online provides an overview of the data mining process, and how data mining techniques can be applied to common business scenarios.
http://msdn.microsoft.com/en-us/library/ms174949.aspx
For each business problem or opportunity that you identify as a candidate for data mining, you need to find appropriate data, and identify an analytical method, or algorithm, that can be applied to the data to generate patterns and insights. The algorithm that you use -- and you might employ several -- depends on the type of question you are asking and the result you need.
Are you exploring relationships, do you have a specific problem to analyze, or do you want to generate recommendations based on past data?
All these are common uses of data mining.
See this section of Books Online for an introduction of the algorithms provided in SQL Server Data Mining, and the types of problems each is best suited for.
http://msdn.microsoft.com/en-us/library/ms175595.aspx
Which algorithms are used in the Data Mining Add-ins for Excel?
Almost all of the algorithms provided by Analysis Services are available in the add-ins, except for linear regression and sequence clustering.
Table Analysis Tools
Analyze Key Influencers - Naive Bayes
Detect Categories - Clustering
Fill From Example -
Forecast - Time series
Highlight Exceptions - Clustering
Goal Seek - Logistic regression
What If - Logistic regression
Prediction Calculator - Logistic regression
Shopping Basket Analysis -- Association
Data Mining Client for Excel
Classify - Decision trees
Estimate - Decision trees
Cluster - Clustering
Associate - Association rules
Forecast - Time series
And here's one little-known feature in the add-ins!
A set of VBA functions is provided to let you use any models you create directly from Excel.
For example, if you have already created a customer segmentation model, and you add a new table of potential customers into Excel, you could use each row of customer data as input to the model, and get back a prediction as to the best segment for each customer.
See the Help file installed with the add-ins for more information on the functions.
How is data mining different from predictive analytics?
There's no real difference.
Predictive analytics is a newer term, which has become more popular and seems to be more closely aligned with the type of exploration and analysis undertaken for business intelligence.
Data mining over the years has become confused with "data scraping" (which is simply data acquisition) and also seems to have become conflated with data capturing and surveillance.
However, there are some kinds of predictive analytics that are not data mining. These might include:
- simulation, including Monte Carlo simulation
- agent-based modeling
- RFM analysis, historical burn analysis, or other methods for applying past business data to generate indices, baselines, and future values.
- optimization techniques, including genetic algorithms, market response models, territory, routing, or schedule design, etc.
- generation of tag clouds, Kohonen maps, or other aids for visual exploration or data interaction
- most types of regression analysis
Use other tools for these types of analytics.
Getting Started with Data Mining
This section provides simple explanation about some of the concepts that can be confusing when you first get started.
- [[Analysis Services Data Source|Data Source]]
[[Data Source View]] - Algorithms -- explanations, tips, and common questions
Tutorials
Over the years many people have done samples, demos, and tutorials.
Here are some examples to get you started:
Additional Resources
This section includes resources to take you from the "I've seen the demo" stage to putting models into production.
[link TBD]
See Also
Other Languages