Predictive Analytics (Data Mining): Advanced Topics
This page is in progress. Please hold off on making edits till I can add more content and clean it up a bit! |
Overview
This page is part 2.
This page provides an answer to some of the frequently asked "intermediate to advanced" questions about SQL Server Data Mining.
An effort has been made to provide links to other resources related to analytics, visualization, modeling, and prediction.
- If you are looking for the official content, please see the MSDN library.
- Note that there is a separate Wiki Community page for SQL Server Analysis Services
- There is also a Wiki page for for PowerPivot. And don't miss the DAX Resource Center if you are looking for formula tips!
- For information on visualizations, check out the Wiki Community page for Power View.
Examples of Articles for this DM Wiki
Please feel welcome to add links to the following sections, create new articles, or or recommend articles that can be referenced from this Wiki.
Articles included here should touch on at least one of these subjects:
- Applications of data mining or predictive analytics, using any of the Microsoft tools (including Transact SQL!)
- Queries and applications using DMX and XMLA
- Predictive analytics using tabular or OLAP models
- Visualization as a way of modeling or interpreting statistical models
- Performance tuning and monitoring of data mining solutions
- Use of AMO and interfaces such as OLEDB to programmatically work with data mining objects.
Other Modeling Tools and Methods
If SQL Server Data Mining does not meet your needs, there are other modeling tools available to you, including extensions to Excel, third-party add-ins for Excel, T-SQL libraries, and of course code samples that you can customize. These may require some programming.
Microsoft Solver Foundation
This free download is a .Net runtime for mathematical programming, modeling, and optimization. Solver Foundation provides powerful libraries for .NET developers seeking to use optimization in their solutions, making it possible for non-experts to define and solve models in their applications.
Where to Get It
Download it here:
Release note: http://blogs.msdn.com/b/solverfoundation/archive/2011/09/12/microsoft-solver-foundation-v3-1-released.aspx
Benefits and Uses
The Solver Foundation has been around a while and there are many samples. Third-party vendors have stepped up to support it.
You can use a plug-in with Excel.
Also provides interface with SharePoint
You must use IronPython (the .NET version of Python)
Visual Studio
F# runtime desired but not required
Drawbacks and Requirements
The current 3.1 release of MSF will be the last release as a standalone install. Microsoft Solver Foundation will be integrated into a larger analytics framework that will help users build both prescriptive and predictive analytics. This new product will provide a migration path for current Solver Foundation users and partners.
Help and Samples
The help has been published to MSDN: http://msdn.microsoft.com/en-us/library/ff524509(VS.93).aspx
The Solver Foundation blog (now quiescent) contains some tips and tricks: http://blogs.msdn.com/b/solverfoundation/
TimeSeries Foundation
The Time Series Foundation (TSF) is an open, .NET platform for exploring and prototyping new algorithms in time series analysis and forecasting. The Time Series foundation is based on the Sho framework but provides advanced functionality for analyzing time series.
Where to Get It
You can download it from the MSR projects site: http://research.microsoft.com/en-us/downloads/d599b5d2-5a99-4ee5-8c04-c180eb9044d9/default.aspx
Benefits and Uses
The Time Series Foundation is based on state space model methodology and includes all types of exponential smoothing, some autoregressive algorithms, and innovative algorithms for event detection and calendar event impact prediction.
The authors provided the time series algorithms implemented in SQL Server Data Mining, but here provide many more options, algorithms, and ways to tweak and optimize a time series.
TSF relies on Excel charting and presentation APIs by implementing an Excel interop layer. Numerical and graphical results of time series analysis and forecasting can be put in programmatically generated workbooks with the help of the Excel layer.
TSF also offers an Excel add-in that exposes a large subset of the platform's functionality through the Excel ribbon UI.
Drawbacks and Requirements
Requires development
Under MSR "no commercial use" License
Help and Samples
The developers have a blog in which they provide tips and guidance: http://blogs.msdn.com/b/alexeib/
Sho
Sho is another DevLabs project that is popular in Microsoft Research.
Sho provides an interactive environment for data analysis and scientific computing that might be useful to people working on technical analysis.
Where to Get It
Installers, documentations, and demos of Sho are available from its MSR project site, here: http://research.microsoft.com/en-us/projects/sho/
Benefits and Uses
The Sho framework lets you seamlessly connect scripts (in IronPython) with compiled code (in .NET) to enable fast and flexible prototyping. The environment includes powerful and efficient libraries for linear algebra as well as data visualization that can be used from any .NET language, as well as a feature-rich interactive shell for rapid development. The seamless nature of our environment allows for some unique capabilities:
- Performing fast linear algebra and statistical computations from either C# or Python
- Adding visualizations (such as plots and bar charts) to your C# or Python applications
- Creating rich applications by lashing together C# libraries using Python
- Debugging mixed python/C# code using Visual Studio 2010
Drawbacks and Requirements
You have to be a developer.
Requires Visual Studio 2010, C#, F#, and Iron Python.
Under MSR license which precludes commercial use.
Help and Samples
To get you started, Sho comes with packages for large-scale parallel computing (via HPC and the Azure), statistics, and optimization, as well as an extensible package mechanism that makes it easy for you to create and share your own packages.
Sho blog: http://blogs.msdn.com/b/the_blog_of_sho/
The documentation might give you a better idea of what is included: http://research.microsoft.com/en-us/um/redmond/projects/sho/documentation/sho2.0/PackageReference/
The Book of Sho includes tutorials: http://research.microsoft.com/en-us/um/redmond/projects/sho/documentation/sho2.0/TheBookofSho.html
These videos illustrate some uses: http://research.microsoft.com/en-us/projects/sho/videos.aspx
Resources and Web pages
The data mining forum on MSDN should be your first stop when getting help. The forum is active.
http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/threads
This site created by the original SQL Server Data Mining team has many resources. In particular, the tips ad tricks section includes code samples, many DMX snippets, and explanations of how to interpret the algorithms.
http://www.sqlserverdatamining.com/
Samples
The R-Squared Sample
In this download, developer Yimin Wu provides sample code for calculating R-squared for a model, using T-SQL. Could be adapted for calculating one-way anlysis of variance (ANOVA)
http://www.microsoft.com/casestudies/
Monte Carlo simulation
This walkthrough by Hilmar Buchta demonstrates how you can use Monte Carlo simulation in T-SQL to verify that the results of your data mining model are valid and represent a statistically signficant improvement over random results. The paper includes all code samples and excellent explanations of the math.
http://msdn.microsoft.com/en-us/library/gg557481.aspx
Thin Client Viewing with Visio 2007
This sample provides a thin client that emulates the data mining temp[lates in Visio.
http://www.sqlserverdatamining.com/ssdm/VisioThinClient
You can extend these to vidualize other model types.
Sample includes:
http://www.sqlserverdatamining.com/ssdm/VisioThinClienthttp://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=78http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=78
XMLA Thin Miner
The XMLA client sample demonstrates how to build DMX statements ont he fly and send them to a server.
http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=80
The Art Of Clustering
This sample provides source code for a custom viewer that uses in-memory clustering models.
http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=79
This sample includes source code, and many custom controls.
Data Validation Sample
This sample demonstrates how data mining can be embedded into an application, to performan data validation.
http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=83
Writing a Custom Algorithm
The following webcast for 2005 describes how to write a custom algorithm, but there have not been signficant changes.
You will need to update the header files.
From an object model programming perspective, there are some changes in AMO/ADOMD properties and related.
These changes are self-explanatory as long as you are aware of the new functionality. For example, you nee to account for properties such as HoldoutActualSize, and HoldoutMaxCases in your model.
Here are some additional links to resources that might help you design a custom algorithm:
The complete tutorial is here (login required): http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=94&Id=165
A downloadable (PDF) white paper provides an overview of the architecture: http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=96&Id=339
Writing a Custom Provider
SQL Server Analysis Services allows other algorithm vendors to develop their own OLE DB for DM provider and plug it into Analysis Services platform.
A white paper discussing the implementation of aggregated data mining providers is available here: http://www.msnusers.com/AnalysisServicesDataMining/Documents/Files/DMAggregator%20white%20paper.zip
The sample code includes the complete implementation of an aggregated provider as well as the following:
- All required OLE DB objects, such as session, command, and rowset
- The OLE DB for Data Mining syntax parser
- Tokenization of input data
- Query processing engine
- A sample Naive Bayes algorithm
- Model archiving in XML and binary formats
The sample provider source code is available to download at http://download.microsoft.com/download/2/4/1/241c5e74-2275-470b-97c8-d3f4a210b9d3/DMSampleProvider.zip.
Data Mining Web Controls Library
There are various possible "front-end" solutions for workign with data mining models in a browser. Some of the possible options are:
- Using SQL 2008 Data Mining add-ins for Excel
- Using WinForms SQL 2008 Data Mining controls
- Reporting Services
- Rhe Data Mining Web Controls Library, available from CodePlex: [LINK]
The Data Mining Web Controls Library retains some of the flexibility of WinForms control, while giving Web access to the mining models.
Caveats:
- They do require some customization.
- The library has not been updated lately.They were primarily intended as a sample and have not been retested with SQL Server 2008. However, they should work with 2008 (and R2).
Advanced Topics FAQ
This section lists some questions that get asked a lot but are harder to answer (and don't get answered at all!)
PMML support
Q. Can I export my trained model into PMML v4? Can I use PMML to move models? Are all algorithms supported in PMML format?
A. SQL 2000 DM produced a form of PMML that had proprietary extensions to support unique features provided by our algorithms that were not available in the original PMML spec.
In SQL 2005, we moved to a newer version of the PMML standard (2.1) that now includes standard support for these features, though we continue to support the original version from SQL 2000.
No, PMML 4.0 is not supported.
The only algorithms that have support (for PMML 2.1) are Decision Trees and Clustering.
If the algorithm supports PMML, you can obtain the PMML representation by using one of these methods:
- You can obtain the PMML representation of a mining model content from the DMSCHEMA_MINING_MODEL_XML schema rowset.
- You can obtain the same result by executing "SELECT * FROM model.PMML".
This forum thread has some comments on PMML: http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/thread/11a8d222-ac2b-4843-99ac-b4a7ff65aadf/
Additional Resources
If you would like to see more general information about predictive analytics, or "getting started" materials, see this page: http://social.technet.microsoft.com/wiki/contents/articles/13672.data-mining-and-predictive-analytics.aspx