Data Science Walkthrough with SQL Server 2017 and Microsoft Machine Learning Services
This post is authored by Xibin Gao, Wei Guo, and Debraj GuhaThakurta at Microsoft.
Microsoft Machine Learning Services were a key highlight of our SQL Server 2017 CTP 2.0 release in April this year.It allows Python scripts to run within SQL Server or be embedded in SQL scripts and be deployed as stored procedures. This feature essentially brings Python visualization and predictive analytics capabilities close to the data stored within SQL Server. Data scientists can combine the powers of SQL and Python and build end-to-end machine learning solutions with much greater ease.
To accompany the release of CTP2.0, we have developed a walkthrough that uses Microsoft Machine Learning Services and provides an end-to-end data science solution for predictive modeling. The walkthrough is based on a popular public data set, namely the New York City taxi dataset. It uses a combination of Python code, SQL Server data, custom SQL functions and stored procedures to build a classification model that predicts the probability that a driver will get a tip on a given cab ride. In the end, it deploys the Python model to SQL Server and uses server data to generate scores based on the model. We previously released an R Services version of this walkthrough.
The walkthrough has six steps. These steps guide you through the data science process, right from data preprocessing, data visualization, feature engineering and model building to operationalization:
Step 1: Download the sample data and the sample SQL scripts to a Windows machine.
Step 3: Explore and visualize the data by calling Python functions from Transact-SQL stored procedures. The plot objects are persisted in the database. A client Python code can retrieve and save those plots. The figure below, of tip amount vs. fare amount for taxi trips, shows an example of the plot.
Step 4: Create data features using custom T-SQL functions.
Step 5: Build and persist machine learning models using stored procedures.
Step 6: Retrieve and de-serialize the model, and then make predictions in batch mode or single mode.
In steps 5 and 6, we have shown how to use modeling and prediction functions from the popular scikit-learn package, as well as the newly released revoscalepy library from Microsoft which supports distributed computing, remote compute contexts and high-performance algorithms for Python.
This data science walkthrough is self-contained, and you can run it step by step with minimal changes. Please try it out and share your feedback with us.
We hope you enjoy using the new Machine Learning Services in SQL Server 2017 for building out your predictive solutions and intelligent apps.
Xibin, Wei & Debraj