Machine Learning Templates with SQL Server 2016 R Services
By Xinwei Xue, Senior Data Scientist Manager
Microsoft recently launched SQL Server 2016, which, in addition to many other great features, offers in-database advanced analytics with R Services, allowing users to combine the power of SQL Server and Microsoft R Server (or Open Source R), without data leaving the database.
With SQL Server R Services, users can develop analytic models in a local R IDE (e.g., R Tools for Visual Studio or RStudio), while data resides in SQL Server, and computation happens on SQL Server (by setting the compute context to SQL Server).
Once the model is ready for production, it can be operationalized via SQL stored procedures (where R code is encapsulated inside), which can be run within SQL Server Management Studio or called by outside applications to make predictions.
To jump-start users on building advanced analytics applications with SQL Server R Services, Microsoft provides a few data science templates that address real-world scenarios, including: online fraud detection, predictive maintenance, and customer churn prediction . These templates are sample advanced analytics solutions that demonstrate best practices and provide building blocks to help users implement a solution quickly. Each template is designed to solve a specific problem, and includes sample data, R code (which uses the highly scalable Microsoft R Server ScaleR APIs) and SQL stored procedure code that extends from data preparation and feature engineering to model training and scoring. The code runs in R IDE (with computation done in SQL Server) or SQL Server client (SQL Server Management Studio) respectively. A Windows PowerShell script is provided to run SQL stored procedures end-to-end. The collection of templates can be found here and more information on each is provided below.
Online Fraud Detection Template (SQL Server R Services)
One of the important tasks of online businesses is to detect fraudulent transactions and identify transactions made by stolen payment instruments or credentials in order to reduce charge back losses. When fraudulent transactions are discovered, online store businesses typically take measures to block related accounts as soon as possible, to prevent further losses. In this scenario, you'll learn how to use data from online purchase transactions to identify likely fraud. Fraud detection is solved as a binary classification problem. The methodology used in this template can easily be applied to fraud detection in other domains. The template shows you how to perform data processing and feature engineering in both R and SQL, while training, evaluating and scoring in the R IDE (development) and SQL stored procedures that call Microsoft R Server code (operationalization).
Customer Churn Prediction Template (SQL Server R Services)
Analyzing and predicting customer churn is important in any industry where the loss of customers to competitors must be managed and prevented – banking, telecommunications, and retail to name a few. The goal of churn analysis is to identify which customers are likely to churn, and then take appropriate actions to retain such customers and keep their business.
This template get you started with churn prevention by formulating the churn problem as a binary classification problem. It uses sample retail data from two sources, customer demographics and customer transactions, to classify customers as likely or unlikely to churn. This template provides SQL stored procedures showing you how to perform data processing and feature engineering in SQL, while training, evaluating and scoring in SQL stored procedures that call Microsoft R Server and Open Source R code.
Predictive Maintenance Template (SQL Server R Services)
The goal of "data-driven" predictive maintenance is to increase the efficiency of maintenance tasks by capturing past failures and using that information to predict when or where a device might fail. The ability to forecast device obsolescence is particularly important for applications that rely on distributed data or sensors, as exemplified by the Internet of Things (IoT).
This template focuses on answering the question of “When will an in-service machine fail?” The input data represents simulated sensor measurements for aircraft engines. Data obtained from monitoring the engine’s current operation conditions, such as the current working cycle, settings, sensor measurements and so forth, are used to create three types of predictive models:
- Regression models, to predict how much longer an engine will last before it fails. The sample model predicts the metric Remaining Useful Life (RUL), also called Time to Failure (TTF).
- Classification models, to predict whether an engine is likely to fail. The binary classification model predicts if an engine will fail within a certain timeframe (number of days). The multi-class classification model predicts whether a particular engine will fail, and if it will fail, provides a probable time window of failure. For example, at a given time, you can predict whether any device is likely to fail within 15 days, or within 30 days.
These templates with SQL Server R Services are equivalent to the ones we published earlier with Azure ML Studio. Our team plans to continue publishing more such templates, so be sure to check back often or subscribe to this blog and stay in the loop.
Xinwei