Using #PolyBase in #SQLServer2016

It’s been a few weeks since the numerous Build and Ignite announcements ushered in the latest and greatest, SQL Server 2016. After having some time to soak it up (aka I’ve been too busy to blog) we will dive into some of features and capabilities I find most interesting.

Poly-what?

While there are many new features and long overdue updates one of the most interesting is the introduction of PolyBase to the SQL Server platform. For those who have not heard or worked with PolyBase, it is not a totally new feature as it has existed on the PDW…err APS appliance for sometime now. So what is it? In short, PolyBase allows you to use your vast array of T-SQL ninja skills to query, join and manipulate unstructured or semi-structure data stored either in HDFS or Azure Blob Storage directly from the SQL Server relational environment all without learning or understanding the first thing about Java, Map/Reduce, Hive or any of the other big data technologies you’ve heard buzzing around. So let’s see how this works.

Hands-On

A a pre-requisite this post assumes you have an Azure account. If you don’t have one….shame on you…just kidding, you can sign-up for an account and even receive a free $200 trial credit. If you have an MSDN membership you may be eligible for a MSDN Azure account that provides $150 a month in Azure credits. With that out of the way let’s build out a SQL Server 2016 sandbox. Begin by signing into the Azure management portal. From the portal home page, click the ‘+ New’ button, then select Compute, Virtual Machine and From Gallery.

image

The virtual machine gallery consist of all sorts of pre-built VM images that you can select from to simplify the set-up/install process. In our case, we are interested in the SQL Server 2016 CPT2 image found under the SQL Server group.

image

During the set-up I recommend using a minimum VM size of A4 (8 cores, 16Gb memory) although this can vary depending on the size of the sample data you will be playing. After making you configuration selections, fight back the anticipation as your VM will be spun up and ready to go within a few minutes. Once your VM shows as ready on the Virtual Machine dashboard, you can select it and navigate to the dashboard. From the dashboard, you can use the Connect button to initiate a remote desktop session using the credential you specified during set-up.

image

Because you use started with a pre-defined image, SQL Server 2016 is already set-up and ready to go sans PolyBase. To add the PolyBase functionality we much run through the installer, but before we can do that, you much update the Java JRE to a version of 7u55 or higher (https://www.oracle.com/technetwork/java/javase/downloads/index.html). After that is complete, launch the SQL Server 2016 installer which is found at: C:\SQLServer_13.0_Full. Click the Next prompts through the process to Add a Feature to an Existing Installation and then simply select PolyBase. The install may take several minutes and then you are really, really ready to go.

image

Open SQL Server Management Studio (SSMS) and connect to the local, SQL 2016 instance. By default, PolyBase integration is turned off. To enable it use the following command and then restart the SQL Server instance.

 EXEC sp_configure 'hadoop connectivity', 4; 
GO 
RECONFIGURE; 
GO

For this demo, we use the 4 option which indicates that we will be using HDInsight or Azure Blob Storage for PolyBase. There are other options for both Cloudera and Hortonworks Hadoop distributes available. After ensuring you have restarted the instance, you are ready to use PolyBase.

Data Source, File Formats and External Tables

These are three parts of PolyBase that you will need to familiarize yourself with and the MSDN documentation is a great reference as you move beyond the simplistic example given in this post. First we will create a data source. Since, we are using an Azure Blob Storage account, we must first create a credential that will be used to pass the access key to our storage account. The script below will create the database credential with the secret key required to access the blob storage account.

 DBCC TRACEON(4631,-1);

CREATE MASTER KEY ENCRYPTION BY PASSWORD='MyP@ssword123';

CREATE CREDENTIAL wasb_bluewaterdata 
ON DATABASE WITH
 IDENTITY = 'bluewaterdata', 
SECRET='<MY SECRET KEY />';

Once the credential is in place, we can define a data source which references a specific container on a blob storage account using the above credential like so:

 CREATE EXTERNAL DATA SOURCE taxidata
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://<CONTAINER />@<ACCOUNT />.blob.core.windows.net/',
    CREDENTIAL = wasb_bluewaterdata
);

Next, we need to create the file format to tell PolyBase how to read the contents of the blobs. In this example, we are simply reading comma-delimited CSV files so our file format will be as follows.

 CREATE EXTERNAL FILE FORMAT CSV 
WITH ( 
    FORMAT_TYPE = DELIMITEDTEXT, 
    FORMAT_OPTIONS ( 
        FIELD_TERMINATOR = ','
    ) 
);

With the two foundational pieces out of the way, we are ready to create a table over top of our “big” data. The example I am using for this post is the NYC Taxi data I used in a previous post. This data as mention before is NYC Taxi trips and fare data in CSV format and includes a header row. I can define an external table that matches the file format as seen below, noting that I’ve referenced my data source, file format and the file folder that contains the data. I also want to point out reject type and value, which in this case accommodates the header row found in the file.

 CREATE EXTERNAL TABLE trips 
( 
    medallion VARCHAR(255),
    hack_license VARCHAR(255),
    vendor_id VARCHAR(255),
    rate_code VARCHAR(255),
    store_and_fwd_flag VARCHAR(255),
    pickup_datetime VARCHAR(255),
    dropoff_datetime VARCHAR(255),
    passenger_count INT,
    trip_time_in_secs INT,
    trip_distance FLOAT,
    pickup_longitude VARCHAR(255),
    pickup_latitude VARCHAR(255),
    dropoff_longitude VARCHAR(255),
    dropoff_latitude VARCHAR(255)
) 
WITH 
( 
    LOCATION = '/trips_small', 
    DATA_SOURCE = taxidata, 
    FILE_FORMAT = CSV,
    REJECT_TYPE = value, 
    REJECT_VALUE=1 
)

image

With my data source, file format and external table created, I can find all three new objects in the SSMS Object Explorer as seen in the screenshot above. I can also query, join or manipulate the data referenced by the new external table like any other SQL Server table all without the data ever residing directly in the SQL Server database. I included a few sample queries that were run directly against the objects created above to round out the demo.

image
image

Wrap-Up

PolyBase is one more tool to help you bridge the gap between your relational and non-relational data stores through the use of your existing SQL skill set. I hope this post served as a general introduction to this technologies capabilities and why it is such an important piece to the SQL Server 2016 release. I also want to point out that as this entire post was written for SQL Server 2016 CTP 2, it is all subject to change before go live. Till next time!

Chris

Comments

  • Anonymous
    July 22, 2015
    If I have a text file instead of the CSV, and I want to get the record count of each word in the text file. Then how can I configure it here, query from SSMS and get the results in 2 columns (word, recordCount)?

  • Anonymous
    October 06, 2015
    So if we are querying "unstructured data" like a csv file in blob storage we still need to figure out all the datatypes before hand? Any tips on speeding up the process of getting column names and data types? Thanks in advance.

  • Anonymous
    November 23, 2015
    Hi, This example is connecting to wasb, using a wasb address in location. In this case polybase can't do a pushdown to hdinsight. How to configure the datasource to enable pushdown ? I followed the polybase documentation and had no success yet. Thank you !