Data Mining with SQL Server 2008
Last week Bogdan and I went through the final steps in completing the SQL (sic) to the wildly popular "Data Mining with SQL Server 2005", cleverly titled "Data Mining with SQL Server 2008". We went through the final PDF's and signed off that the companion web content is good to go. I am definitely pleased with how this book turned out. I listened to the feedback on the first book and pressed the authors and the editors to ensure that this book has much higher "initial code quality" as you might call it.
Writing a book isn't really all that much different than writing software, particularly when the book is about software, I suppose. The authors are the developers and the editors are the testers, and I have to say that the editors took us to task ensuring that every phrase turned correctly and no word was used wastefully. It was great help that, unlike the first edition, we were able to have the same editors throughout, and even though at times it seemed they were being a bit picky :), they were able to catch similar phrases used differently across multiple chapters and made a huge impact on the final text. Also, Shuvro, from the DM team here, manually verified each and every query and line of code that is included on the companion site. (I suppose that it helps that the book was written after the product was completed, so that API's and language constructs didn't change in flight)
So, readers of the first book will probably want to know what's new and different about this one. Overall, the book has generally the same structure as the first - so it's evolutionary rather than revolutionary. That being said, I would guess that at over half the content is new or heavily modified. For example, I rewrote most of the introduction, there are two new chapters covering the Data Mining Addins for Office, and the "OLE DB for Data Mining" chapter was completely reimagined and rewritten as "Data Mining Concepts and DMX." Each of the algorithm chapters was updated with respect to new features and comments from forums and newsgroups. Even where there were no significant changes to the algorithm between 2005 and 2008, the chapters were reorganized to focus on the practical application first and leave the technical implementation details to the latter portion of the chapter. Also, Bogdan, borrowing from his 50-page epic whitepaper SQL Server Data Mining Programmability, took over and greatly enhanced the Architecture and API chapters.
According to Amazon, the book will be released on November 10th and you can pre-order now. Since the "read inside" feature isn't available yet (likely since the book is being printed as I write this), I've taken the liberty and pasted the table of contents to make this probably the longest blog post I've ever had the pleasure of pasting. Anyway, I hope this gives you a good idea of what's coming in the book so you can preorder with confidence! :)
Foreword xxix
Introduction xxxi
Chapter 1 Introduction to Data Mining in SQL Server 2008 1
Business Problems for Data Mining 4
Data Mining Tasks 6
Classification 6
Clustering 6
Association 7
Regression 8
Forecasting 8
Sequence Analysis 9
Deviation Analysis 9
Data Mining Project Cycle 9
Business Problem Formation 10
Data Collection 10
Data Cleaning and Transformation 10
Model Building 12
Model Assessment 12
Reporting and Prediction 12
Application Integration 13
Model Management 13
Summary 13
Chapter 2 Applied Data Mining Using Microsoft Excel 2007 15
Setting Up the Table Analysis Tools 16
Configuring Analysis Services with Administrative Privileges 17
Configuring Analysis Services without Administrative
Privileges 18
What the Add-Ins Expect 19
What to Do If You Need Help 22
The Analyze Key Influencers Tool 22
The Main Influencers Report 24
The Discrimination Report 26
Summary of the Analyze Key Influencers Task 28
The Detect Categories Tool 28
Launching the Tool 29
The Categories Report 30
Categories and the Number of Rows in Each 30
Characteristics of Each Category 31
The Category Profiles Chart 32
Summary of the Detect Categories Tool 34
The Fill From Example Tool 35
Running the Tool and Interpreting the Results 36
Refining the Results 38
Summary of the Fill From Example Tool 39
The Forecasting Tool 39
Launching the Tool and Specifying Options 40
Interpreting the Results 42
Summary of the Forecast Tool 44
The Highlight Exceptions Tool 44
Using the Tool 45
More Complex Interactions 48
Limitations and Troubleshooting 50
Summary of the Highlight Exceptions Tool 51
The Scenario Analysis Tool 51
The Goal Seek Tool 53
Using Goal Seek for a Numeric Goal 56
Using Goal Seek for the Whole Table 57
TheWhat-If Tool 58
UsingWhat-If for the Whole Table 61
Summary of the Scenario Analysis Tool 62
The Prediction Calculator Tool 62
Running the Tool 64
The Prediction Calculator Spreadsheet 65
The Printable Calculator Spreadsheet 67
Refining the Results 68
Using the Results 73
Summary of the Prediction Calculator Tool 73
The Shopping Basket Analysis Tool 74
Using the Tool 75
The Bundled Item Report 76
The Recommendations Report 77
Tweaking the Tool 79
Summary of the Shopping Basket Analysis Tool 81
Technical Overview of the Table Analysis Tools 81
Summary 82
Chapter 3 Data Mining Concepts and DMX 83
History of DMX 83
Why DMX? 84
The Data Mining Process 85
Key Concepts 86
Attribute 86
State 87
Case 88
Keys 89
Inputs and Outputs 91
DMX Objects 93
Mining Structure 93
Mining Model 94
DMX Query Syntax 95
Creating Mining Structures 96
Discretized Columns 97
Nested Tables 98
Partitioning into Testing and Training Sets 99
Creating Mining Models 100
Nested Tables 101
Complex Nesting Scenarios 104
Filters 107
Populating Mining Structures 108
Populating Nested Tables 110
Querying Structure Data 112
Querying Model Data 112
Prediction 115
Prediction Join 116
Prediction Query Syntax 116
Nested Source Data 117
Real-Time Prediction 118
Degenerate Predictions 119
Prediction Functions 120
PredictNodeID 122
External and User-Defined Functions 123
Predictions on Nested Tables 123
Predicting Nested Value Columns 124
Summary 125
Chapter 4 Using SQL Server Data Mining 127
Introducing the Business Intelligence Development Studio 128
Understanding the User Interface 128
Offline Mode and Immediate Mode 130
Immediate Mode 131
Getting Started in Immediate Mode 131
Offline Mode 132
Getting Started in Offline Mode 133
Switching Project Modes 135
Creating Data Mining Objects 135
Setting Up Your Data Sources 135
Understanding Data Sources 136
Creating the MovieClick Data Source 137
Using the Data Source View 137
Creating the MovieClick Data Source View 138
Working with Named Calculations 140
Creating a Named Calculation on the Customers Table 142
Working with Named Queries 142
Creating a Named Query Based on the Customers Table 143
Organizing the DSV 144
Exploring Data 145
Creating and Editing Models 148
Structures and Models 148
Using the Data Mining Wizard 148
Creating the MovieClick Mining Structure and Model 155
Using Data Mining Designer 157
Working with the Mining Structure Editor 157
Adding the Genre Column to the Movies Nested Table 159
Working with the Mining Models Editor 160
Creating and Modifying Additional Models 163
Processing 164
Processing the MovieClick Mining Structure 165
Using Your Models 166
Understanding the Model Viewers 166
Using the Mining Accuracy Chart 167
Selecting Test Data 168
Understanding the Accuracy Charts 169
Using the Profit Chart 172
Multiple Target Accuracy Charts 172
Using the Classification Matrix 173
Scatter Accuracy Charts 173
Creating a Lift Chart on MovieClick 174
Using CrossValidation 174
Using the Mining Model Prediction Builder 178
Executing a Query on the MovieClick Model 179
Creating Data Mining Reports 180
Using SQL Server Management Studio 181
Understanding the Management Studio User Interface 182
Using Server Explorer 182
Using Object Explorer 183
Using the Query Editor 184
Summary 185
Chapter 5 Implementing a Data Mining Process Using Office 2007 187
Importing Data using the Data Mining Client 189
Data Exploration and Preparation 190
Discretizing Data with the Explore Data Tool 191
Chopping Off the Long Tail 191
Consolidating Meaning 192
Eliminating Spurious Values 194
Rebalancing Data 195
Modeling 196
Task-Based Modeling 196
Introduction 198
Select Data 198
Select Columns and Options 198
Split Data 200
Finishing the Task 200
Advanced Modeling in the Data Mining Client 200
Accuracy and Validation 203
Model Usage 204
Browsing Models 204
Viewing Models with Visio 205
Querying Models 208
QueryWizard 208
Data Mining Cell Functions 211
DMPREDICT 211
DMPREDICTTABLEROW 212
DMCONTENTQUERY 212
Model Management 213
Trace 213
Summary 213
Chapter 6 Microsoft Naıve Bayes 215
Introducing the Naıve Bayes Algorithm 216
Using the Naıve Bayes Algorithm 216
Creating a Predictive Model 217
Data Exploration 219
Analysis of Key Influencers 219
Document Classification 220
DMX 222
Drill-Through 222
Understanding Naıve Bayes Content 223
Exploring a Naıve Bayes Model 225
Dependency Network 225
Attribute Profiles 226
Attribute Characteristics 227
Attribute Discrimination 228
Understanding Naıve Bayes Principles 229
Limitations of the Naıve Bayes Algorithm 231
Naıve Bayes Parameters 233
MAXIMUM INPUT ATTRIBUTES 233
MAXIMUM OUTPUT ATTRIBUTES 233
MAXIMUM STATES 233
MINIMUM DEPENDENCY PROBABILITY 234
Summary 234
Chapter 7 Microsoft Decision Trees Algorithm 235
Introducing Decision Trees 236
Using Decision Trees 237
Creating a Decision Tree Model 237
DMX Queries 237
Classification Model 237
Regression Model 239
Association 241
Model Content 244
Interpreting the Model 244
Decision Tree Principles 248
Basic Concepts of Tree Growth 248
Working with Many States in an Attribute 251
Avoiding Overtraining 252
Incorporating Prior Knowledge 252
Feature Selection 253
Using Continuous Inputs 253
Regression 254
Association Analysis with Microsoft Decision Trees 255
Parameters 256
COMPLEXITY PENALTY 257
MINIMUM SUPPORT 257
SCORE METHOD 257
SPLIT METHOD 258
MAXIMUM INPUT ATTRIBUTES 258
MAXIMUM OUTPUT ATTRIBUTES 258
FORCE REGRESSOR 258
Stored Procedures 259
Summary 260
Chapter 8 Microsoft Time Series Algorithm 263
Overview 264
Usage 265
Time Series Scenarios 267
Performing a Simple Forecast 267
Predicting Interdependent Series 268
Understanding Your Time Series 268
What-If Scenarios 269
Predicting New Series 269
DMX 270
Model Creation 270
Model Processing 272
Forecasting 274
Returning Supplemental Statistics 275
Changing the Future —Executing a What-If Forecast 276
Forecasting with Little Data —Applying Models to New
Data 277
Drill-Through 280
Principles of Time Series 280
Autoregression 281
Periodicity 281
Autoregression Trees 282
Prediction 284
Parameters 285
MISSING VALUE SUBSTITUTION 285
PERIODICITY HINT 286
AUTO DETECT PERIODICITY 286
MINIMUM and MAXIMUM SERIES VALUE 286
FORECAST METHOD 286
PREDICTION SMOOTHING 287
INSTABILITY SENSITIVITY 287
HISTORIC MODEL COUNT and HISTORIC MODEL GAP 287
COMPLEXITY PENALTY and MINIMUM SUPPORT 288
Model Content 289
Summary 289
Chapter 9 Microsoft Clustering 291
Overview 292
Usage of Clustering 294
Performing a Clustering 295
Clustering as an Analytical Step 297
Anomaly Detection Using Clustering 297
DMX 299
Model Creation 300
Drill-Through 301
Cluster 301
ClusterProbability 301
PredictHistogram 302
PredictCaseLikelihood 302
Model Content 303
Understanding Your Cluster Models 304
Get a High-Level Overview 305
Pick a Cluster and Determine How It Is Different from the
General Population 307
Determine How the Cluster Is Different from Nearby
Clusters 308
Verify that Your Assertions Are True 309
Label the Cluster 309
Principles of Clustering 309
Hard Clustering versus Soft Clustering 311
Discrete Clustering 312
Scalable Clustering 313
Clustering Prediction 314
Parameters 314
CLUSTERING METHOD 314
CLUSTER COUNT 315
MINIMUM CLUSTER CASES 315
MODELLING CARDINALITY 316
STOPPING TOLERANCE 316
SAMPLE SIZE 316
CLUSTER SEED 317
MAXIMUM INPUT ATTRIBUTES 317
MAXIMUM STATES 318
Summary 318
Chapter 10 Microsoft Sequence Clustering 319
Introducing the Microsoft Sequence Clustering Algorithm 320
Using the Microsoft Sequence Clustering Algorithm 320
Creating a Sequence Clustering Model 321
DMX Queries 322
Executing Cluster Predictions 323
Executing Sequence Predictions 323
Extracting the Probability for the Sequence Predictions 325
Using the Histogram of the Sequence Predictions 326
Detecting Unusual Sequence Patterns 329
Interpreting the Model 329
Cluster Diagram 330
Cluster Profiles 331
Cluster Characteristics 331
Cluster Discrimination 333
State Transitions 333
Microsoft Sequence Clustering Algorithm Principles 334
Understanding a Markov Chain 334
Order of a Markov Chain 335
State Transition Matrix 336
Clustering with a Markov Chain 337
Cluster Decomposition 339
Model Content 339
Algorithm Parameters 340
CLUSTER COUNT 340
MINIMUM SUPPORT 340
MAXIMUM STATES 341
MAXIMUM SEQUENCE STATES 341
Summary 341
Chapter 11 Microsoft Association Rules 343
Introducing Microsoft Association Rules 344
Using the Association Rules Algorithm 344
Data Exploration Models 345
A Simple Recommendation Engine 346
Advanced Cross-Sales Analysis 349
DMX 351
Model Content 355
Interpreting the Model 357
Association Algorithm Principles 359
Understanding Basic Association Algorithm Terms and
Concepts 359
Itemset 360
Support 360
Probability (Confidence) 361
Importance 361
Finding Frequent Itemsets 363
Generating Association Rules 366
Prediction 367
Algorithm Parameters 368
MINIMUM SUPPORT 368
MAXIMUM SUPPORT 368
MINIMUM PROBABILITY 368
MINIMUM IMPORTANCE 368
MAXIMUM ITEMSET SIZE 369
MINIMUM ITEMSET SIZE 369
MAXIMUM ITEMSET COUNT 369
OPTIMIZED PREDICTION COUNT 369
AUTODETECT MINIMUM SUPPORT 369
Summary 370
Chapter 12 Microsoft Neural Network and Logistic Regression 371
Same Principle, Two Algorithms 372
Using the Microsoft Neural Network 373
Text Classification Models 373
Utility Models 378
DMX Queries 378
Model Content 381
Interpreting the Model 382
Principles of the Microsoft Neural Network Algorithm 384
What Is a Neural Network? 385
Combination and Activation 387
Backpropagation, Error Function, and Conjugate Gradient 389
A Simple Example of Processing a Neural Network 390
Normalization and Mapping 392
Topology of the Network 393
Training the Ending Condition 394
Nonlinearly Separable Classes 395
Algorithm Parameters 396
MAXIMUM INPUT ATTRIBUTES 396
MAXIMUM OUTPUT ATTRIBUTES 396
MAXIMUM STATES 396
HOLDOUT PERCENTAGE 397
HOLDOUT SEED 397
HIDDEN NODE RATIO 397
SAMPLE SIZE 397
Summary 397
Chapter 13 Mining OLAP Cubes 399
Introducing OLAP 400
Understanding Star and Snowflake Schemas 401
Understanding Dimension and Hierarchy 402
Understanding Measures and Measure Groups 404
Understanding Cube Processing and Storage 404
Using Proactive Caching 405
Querying a Cube 406
Performing Calculations 407
Browsing a Cube 408
Understanding Unified Dimension Modeling 408
Understanding the Relationship between OLAP and Data
Mining 413
Mining Aggregated Data 414
OLAP Pattern Discovery Needs 415
OLAP Mining versus Relational Mining 415
Building OLAP Mining Models Using Wizards and Editors 417
Using the Data Mining Wizard 417
Building the Customer Segmentation Model 417
Creating a Market Basket Model 420
Creating a Sales Forecast Model 424
Using the Data Mining Designer 428
Understanding Data Mining Dimensions 429
Using MDX within DMX Queries 432
Using Analysis Management Objects for the OLAP Mining
Model 434
Summary 438
Chapter 14 Data Mining with SQL Server Integration Services 439
An Overview of SSIS 440
Understanding SSIS Packages 442
Task Flow 442
Standard Tasks in SSIS 442
Containers 443
Debugging 444
Exploring a Control Flow Example 444
Data Flow 444
Transformations 445
Viewers 446
Exploring a Data Flow Example 447
Working with SSIS in Data Mining 447
Data Mining Tasks 448
Data Mining Query Task 449
Analysis Services Processing Task 452
Analysis Services Execute DDL Task 453
Data Mining Transformations 455
Data Mining Model Training Destination 455
Data Mining Query Transformation 458
Example Data Flows 462
Using Non-Predictive Data Mining Queries in an
Integration Services Pipeline 463
Text Mining Transformations 464
Term Extraction Transformation 465
Term Lookup Transformation 467
More Details on the Text Mining Process 470
Summary 472
Chapter 15 SQL Server Data Mining Architecture 475
Introducing Analysis Services Architecture 476
XML for Analysis 476
XMLA APIs 477
Discover 478
Execute 479
XMLA and Analysis Services 480
Processing Architecture 482
Predictions 486
Data Mining Administration 487
Server Configuration 488
Data Mining Security 489
Security Requirements for Creating and Training Mining
Objects 491
Security for Various Deployment Scenarios 491
Local Database and Analysis Services 492
Local Analysis Services and a Remote Database 493
Intranet Analysis Services and Databases on the Same
Server 493
Analysis Services and Databases behind an HTTP
Endpoint in an Internet Deployment 494
Configuring Analysis Services for Use with Data Mining
Excel Add-Ins over HTTP 495
Summary 496
Chapter 16 Programming Sql Server Data Mining 497
Data Mining APIs 498
ADO 498
ADO.NET 500
ADOMD.NET 501
Server ADOMD.NET 501
AMO 501
Using Analysis Services APIs 502
Using Microsoft.AnalysisServices to Create and Manage
Mining Models 502
AMO Basics 503
AMO Applications and Security 505
Object Creation 506
Creating Data Access Objects 507
Creating the Mining Structure 510
Creating the Mining Models 512
Processing Mining Models 513
Deploying Mining Models 515
Setting Mining Permissions 516
Browsing and Querying Mining Models 517
Predicting with ADOMD.NET 517
More on Table-Valued Parameters in ADOMD.NET 522
Browsing Models 525
Stored Procedures 527
Writing Stored Procedures 529
Stored Procedures and Prepare Invocations 530
A Stored Procedure Example 530
Executing Queries inside Stored Procedures 533
Returning Data Sets from Stored Procedures 534
Deploying and Debugging Stored Procedure Assemblies 537
Summary 538
Chapter 17 Extending SQL Server Data Mining 541
Plug-in Algorithms 542
Plug-in Algorithm Framework 543
Lifetime of a Plug-in Algorithm Instance 543
Conceptual Overview 545
Model Creation and Processing 547
Prediction 553
Content Navigation 554
Custom Functions 555
PMML 557
Managed vs. Native Plug-ins 557
Installing Plug-in Algorithms 558
Where to Find Out More about Plug-in Algorithms 558
Data Mining Viewers 558
Interfaces to Be Implemented 559
Rendering the Information 559
Retrieving Information from Analysis Services 560
Registering the Viewer 561
Where to Find Out Mode about Plug-in Viewers 561
Summary 562
Chapter 18 Implementing a Web Cross-Selling Application 563
Source Data Description 564
Building Your Model 564
Identifying the Data Mining Task 564
Using Decision Trees for Association 565
Using the Association Rules Algorithm 567
Comparing the Two Models 568
Making Predictions 570
Making Batch Prediction Queries 570
Using Singleton Prediction Queries 572
Integrating Predictions with Web Applications 573
UnderstandingWeb Application Architecture 573
Setting the Permissions 574
Examining Sample Code for the Web Recommendation
Application 575
Summary 578
Chapter 19 Conclusion and Additional Resources 581
Recapping the Highlights of SQL Server 2008 Data Mining 581
State-of-the-Art Algorithms 582
Easy-to-Use Tools 583
Simple-Yet-Powerful API 584
Integration with Sibling BI Technologies 584
Exploring New Data Mining Frontiers and Opportunities 585
Further Reference 586
Microsoft Data Mining 586
General Data Mining 586
Appendix A Data Sets 589
Appendix B Supported Functions 595
Index 607
Comments
Anonymous
October 02, 2008
PingBack from http://www.easycoded.com/data-mining-with-sql-server-2008/Anonymous
November 05, 2008
I'm often asked "what references should I have for ... <Insert Technology here>. " ForAnonymous
March 10, 2012
Great , Thought this might be useful, I found this free tool that integrates credit checking and credit scoring by directly credit checking Microsoft Dynamics CRM, credit checking in Microsoft AX 2012, Business Contact Manager and credit scoring directing into Existing spreadsheets for Microsoft Excel. Best part is its completely free. Here’s the Install link and a YouTube link underneath that: pinpoint.microsoft.com/.../application... http://youtu.be/abwI7cc3Yio Hope it’s useful for you guys.