SQL Server 2019 : Import Data from CSV in Azure Blob Storage using T-SQL and Polybase
Introduction
PolyBase enables us to write Transact-SQL queries in SQL Server that would read data from external data sources. SQL Server 2016 and higher can access external data in Hadoop and Azure Blob Storage. In this article we will see how to Setup Polybase for SQL server 2019 and use it to query a CSV file in Azure Blob Storage
Configure Polybase in SQL Server 2019
Before we start with Poybase installation, ensure that you have Java runtime installed from here, else you may get the below error “Unable to load DLL jvm.dll”
Download the latest JRE depending on the system bit.
Lets configure polybase in SQL Server 2019 by using the installation file.
Select the ‘New SQL Server stand-along or add features to an existing installation’ to add the polybase feature.
Select the Polybase checkbox and click on Next to proceed with the installation of the feature.
Also ensure that the Polybase engine services are running from services.msc
Now that we have added the feature, Finally, enable Polybase, use the following configuration script:
USE master;
GO
EXEC sp_configure 'show advanced option', '1';
RECONFIGURE;
GO
EXEC sp_configure 'hadoop connectivity', 4;
GO
RECONFIGURE;
GO
Show advanced option 1 is used to enable advanced option and hadoop connectivity 4, is used to connect to Hadoop in Windows Servers.
Create a Blob Storage Container
Now lets head to the Azure Portal and create a Blob Storage container in one of the existing Storage account.
Click on Containers option to create a container
We will use this storage account and container for external table creation.
Click on Upload button to upload the csv file to the container.
We have uploaded the employeedetails.csv to the container.
Execute T-SQL Commands
Let us enable Polybase feature and check for its activation using the below script
exec sp_configure @configname = 'polybase enabled', @configvalue = 1;
RECONFIGURE;
SELECT SERVERPROPERTY ('IsPolyBaseInstalled') AS IsPolyBaseInstalled;
Now let get started with the series of T-SQL commands to work with the import of csv data to SQL Server.
First step is to create a Master Key, which is a symmetric key used to protect the private keys. Head over to SSMS and run the below script.
CREATE MASTER KEY ENCRYPTION BY PASSWORD='PolyBase@2020';
The next step is to create a database credential, it can be any arbitrary name lets name it ‘DatabaseCredential’.Specify a name for the Identity as well, which can also be an arbitrary name which in our case is ‘Polybase’. Secret is the key to access to the Azure Storage Account, which you can get from the Azure Storage Account overview page.
CREATE DATABASE SCOPED CREDENTIAL DatabaseCredential
WITH IDENTITY = 'Polybase', Secret = '/CzBQDMc25Nl/XMr6uZ7IVqhkFnEO5PEMHW1jvDuGT3jHUDs0RJL1p+IJ4f7PxEVH4IUrvSU74rFdyNHxKoNQw=='
Now lets create the datasource which will connect to the Azure Storage account
CREATE EXTERNAL DATA SOURCE EmployeeInformation
WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://polybasecsv@helpdesksentimentanalyse.blob.core.windows.net',
CREDENTIAL = DatabaseCredential
);
We will be able to see the external data source created in Database -> External Resources -> External Data Sources:
We have now created the access connection to external data stored in EmployeeInformation.csv, We also need to specify the format of the external file which we will do with the below script:
The external file format will be created and is visible under Database -> External Data sources ->External File Formats
We will now create an external table so to query the external file, Employeeinforation.csv like a SQL table:
CREATE EXTERNAL TABLE EmployeeInformation
(
EmployeeName varchar(25),
EmployeeNumber varchar(6),
Department varchar(10),
Technology varchar(10)
)
WITH
(
LOCATION = '/',
DATA_SOURCE = EmployeeInformation,
FILE_FORMAT = csvformat
)
Query the External Table
Thus we have created the external table out of the csv file in the Azure Blob Storage, now lets try to query the table just like querying a normal SQL Server table
Select * From EmployeeInformation
Summary
Thus we saw how to setup Polybase for SQL Server and import external data from a CSV file stored in Azure Blob Storage using the T-SQL commands from SQL Server