Use Apache Sqoop to import and export data between Apache Hadoop on HDInsight and Azure SQL Database
Learn how to use Apache Sqoop to import and export between an Apache Hadoop cluster in Azure HDInsight and Azure SQL Database or Microsoft SQL Server. The steps in this document use the sqoop
command directly from the headnode of the Hadoop cluster. You use SSH to connect to the head node and run the commands in this document. This article is a continuation of Use Apache Sqoop with Hadoop in HDInsight.
Prerequisites
Completion of Set up test environment from Use Apache Sqoop with Hadoop in HDInsight.
An SSH client. For more information, see Connect to HDInsight (Apache Hadoop) using SSH.
Familiarity with Sqoop. For more information, see Sqoop User Guide.
Set up
Use ssh command to connect to your cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
For ease of use, set variables. Replace
PASSWORD
,MYSQLSERVER
, andMYDATABASE
with the relevant values, and then enter the commands below:export PASSWORD='PASSWORD' export SQL_SERVER="MYSQLSERVER" export DATABASE="MYDATABASE" export SERVER_CONNECT="jdbc:sqlserver://$SQL_SERVER.database.windows.net:1433;user=sqluser;password=$PASSWORD" export SERVER_DB_CONNECT="jdbc:sqlserver://$SQL_SERVER.database.windows.net:1433;user=sqluser;password=$PASSWORD;database=$DATABASE"
Sqoop export
From Hive to SQL.
To verify that Sqoop can see your database, enter the command below in your open SSH connection. This command returns a list of databases.
sqoop list-databases --connect $SERVER_CONNECT
Enter the following command to see a list of tables for the specified database:
sqoop list-tables --connect $SERVER_DB_CONNECT
To export data from the Hive
hivesampletable
table to themobiledata
table in your database, enter the command below in your open SSH connection:sqoop export --connect $SERVER_DB_CONNECT \ -table mobiledata \ --hcatalog-table hivesampletable
To verify that data was exported, use the following queries from your SSH connection to view the exported data:
sqoop eval --connect $SERVER_DB_CONNECT \ --query "SELECT COUNT(*) from dbo.mobiledata WITH (NOLOCK)" sqoop eval --connect $SERVER_DB_CONNECT \ --query "SELECT TOP(10) * from dbo.mobiledata WITH (NOLOCK)"
Sqoop import
From SQL to Azure storage.
Enter the command below in your open SSH connection to import data from the
mobiledata
table in SQL, to thewasbs:///tutorials/usesqoop/importeddata
directory on HDInsight. The fields in the data are separated by a tab character, and the lines are terminated by a new-line character.sqoop import --connect $SERVER_DB_CONNECT \ --table mobiledata \ --target-dir 'wasb:///tutorials/usesqoop/importeddata' \ --fields-terminated-by '\t' \ --lines-terminated-by '\n' -m 1
Alternatively, you can also specify a Hive table:
sqoop import --connect $SERVER_DB_CONNECT \ --table mobiledata \ --target-dir 'wasb:///tutorials/usesqoop/importeddata2' \ --fields-terminated-by '\t' \ --lines-terminated-by '\n' \ --create-hive-table \ --hive-table mobiledata_imported2 \ --hive-import -m 1
Once the import has completed, enter the following command in your open SSH connection to list out the data in the new directory:
hadoop fs -tail /tutorials/usesqoop/importeddata/part-m-00000
Use beeline to verify that the table has been created in Hive.
Connect
beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http'
Execute each query below one at a time and review the output:
show tables; describe mobiledata_imported2; SELECT COUNT(*) FROM mobiledata_imported2; SELECT * FROM mobiledata_imported2 LIMIT 10;
Exit beeline with
!exit
.
Limitations
Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to SQL doesn't support bulk inserts.
Batching - With Linux-based HDInsight, When using the
-batch
switch when performing inserts, Sqoop makes multiple inserts instead of batching the insert operations.
Important considerations
Both HDInsight and SQL Server must be on the same Azure Virtual Network.
For an example, see How to connect HDInsight to your on-premises network document.
For more information on using HDInsight with an Azure Virtual Network, see how to extend HDInsight with Azure Virtual Network document. For more information on Azure Virtual Network, see the Virtual Network Overview document.
SQL Server must be configured to allow SQL authentication. For more information, see the Choose an Authentication Mode document.
You may have to configure SQL Server to accept remote connections.
Next steps
Now you've learned how to use Sqoop. To learn more, see:
- Use Apache Oozie with HDInsight: Use Sqoop action in an Oozie workflow.
- Analyze flight delay data using HDInsight: Use Interactive Query to analyze flight delay data, and then use Sqoop to export data to a database in Azure.
- Upload data to HDInsight: Find other methods for uploading data to HDInsight/Azure Blob storage.