导出HDInsight数据到MySQL(Azure)
在大数据计算中,经常会有一个场景:在Hadoop分布式存储DFS和关系型数据库(MySQL、Oracle、SQL Server)间导入(导出)数据。由于Azure HDInsight是基于公有云的托管式Hadoop,用户在执行数据转移时,与云上Hadoop相连的关系型数据库往往不是本地的数据库引擎,取而代之的是公有云提供的关系型数据库服务,如SQL Database和MySQL on Azure。针对如何使用HDInsight中的工具Sqoop导入(导出)数据到SQL Database,Azure官方文档给出了参考教程,本文具体实践如何使用Sqoop(HDInsight)导出数据到MySQL on Azure的场景。
关于HDInsight:https://azure.microsoft.com/en-us/documentation/articles/hdinsight-high-availability/,本实践中将会提交任务到HDInsight集群,集群中的Head Node分配任务到Worker Node,Worker Node具体实施数据的读取、处理和转移。
关于MySQL on Azure:https://www.windowsazure.cn/home/features/mysql/
关于Sqoop: https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html
关于HDInsight和Azure Storage:https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/,HDInsight中的数据以冗余的形式存储在Azure Storage中,HDInsight节点可以通过分布式存储路径(WASB)来读写目标数据。同时,用户也可以直接访问Azure Storage对应容器中的HDInsight数据文件。
注:本实践通过Azure PowerShell来实现,主要包括:
- 创建MySQL数据表,用于接收Sqoop导出的数据。
- 上传数据样本文件到HDInsight的存储(Azure Storage)
- 提交数据导出任务(Sqoop Job)到HDInsight集群
- 跟踪Sqoop任务执行情况并检查输出结果。
第一步:准备MySQL数据表
#MySQL variables = "mysqlservices-sha.chinacloudapp.cn"
$MySQLLogin = "login1@jianwmysql" #注意格式: 登录用户@MySQL服务器名
$MySQLPassword = "password"
$MySQLExportedDBName = "testdb01"
$tableName = "jtable3"
$confirmation= Read-Host "To create a new MySQL table, input 'y'; use existing MySQL table, input anything else ..."
if ($confirmation -eq 'y') {
$MySQLAdminUserName =$MySQLLogin;
$MySQLAdminPassword =$MySQLPassword;
$MySQLDatabase =$MySQLExportedDBName;
$MySQLHost =$MySQLServerName;
$ConnectionString ="server=" +$MySQLHost +";port=3306;uid=" + $MySQLAdminUserName+ ";pwd="+ $MySQLAdminPassword + ";database="+$MySQLDatabase
$Query = "CREATE TABLE " + $tableName + "(id VARCHAR(40),name VARCHAR(40))"
Try {
[void][System.Reflection.Assembly]::LoadWithPartialName("MySql.Data")
$Connection = New-Object MySql.Data.MySqlClient.MySqlConnection
$Connection.ConnectionString = $ConnectionString
$Connection.Open()
$Command = New-Object MySql.Data.MySqlClient.MySqlCommand($Query, $Connection)
$DataAdapter = New-Object MySql.Data.MySqlClient.MySqlDataAdapter($Command)
$DataSet = New-Object System.Data.DataSet
$RecordCount = $dataAdapter.Fill($dataSet, "data")
$DataSet.Tables[0]
$Query = "select count(*) from " + $tableName
$Command = New-Object MySql.Data.MySqlClient.MySqlCommand($Query, $Connection)
$DataAdapter = New-Object MySql.Data.MySqlClient.MySqlDataAdapter($Command)
$DataSet = New-Object System.Data.DataSet
$RecordCount = $dataAdapter.Fill($dataSet, "data")
$DataSet.Tables[0].Rows[0].ItemArray[0].ToString();
[System.Console]::WriteLine("MySQL table was created successfully.");
}
Catch {
Write-Host "ERROR : Unable to run query : $query `n$Error[0]"
}
Finally {
$Connection.Close()
}
}
第二步:上传样本数据
Add-AzureAccount -Environment azurechinacloud
# Define the cluter variables
$clusterName = "hadoopjianw3"
$storageAccountName = "jianwstorage"
$containerName = "hadoopjianw3"
#Upload a file into Azure storage (DFS of Hadoop)
$StorageAccountKey = Get-AzureStorageKey -StorageAccountName $storageAccountName
$Ctx = New-AzureStorageContext
$StorageAccountName -StorageAccountKey $StorageAccountKey.Primary
Get-ChildItem -Path d:\log2copy.txt | Set-AzureStorageBlobContent -Container $containerName -Blob tutorials2/log2copy.txt -Context $Ctx
此处用到的样本文件log2copy.txt,含有以下样本数据(分隔符为tab跳格),该样本文件可见附件。
mytest 123
world 456
第三步:提交Sqoop任务到HDInsight
[System.Console]::WriteLine("Start Sqoop-MySQL test ...");
#Add-AzureAccount
Add-AzureAccount -Environment azurechinacloud
# Define the cluter variables
$clusterName = "hadoopjianw3"
$storageAccountName = "jianwstorage"
$containerName = "hadoopjianw3"
# Connection string for Azure MySQL Database.
$conStr = "jdbc:mysql://$MySQLServerName/$MySQLExportedDBName"
#Hive file location
$exportDir = "wasb://$containerName@$storageAccountName.blob.core.chinacloudapi.cn/tutorials2"
$sqoopDef = New-AzureHDInsightSqoopJobDefinition -Command "export --connect $conStr --username $MySQLLogin --password $MySQLPassword --table $tableName --export-dir $exportDir --fields-terminated-by \t -m 1"
$sqoopJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $sqoopDef -Debug -Verbose
Wait-AzureHDInsightJob -WaitTimeoutInSeconds 3600 -Job $sqoopJob
Write-Host "Standard Error" -BackgroundColor Green
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $sqoopJob.JobId -StandardError
Write-Host "Standard Output" -BackgroundColor Green
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $sqoopJob.JobId -StandardOutput
第四步:跟踪Sqoop任务并检查输出
查看HDInsight的控制台主页
查看Sqoop Job的执行结果:
查看MySQL中的结果数据
调试与问题解决:
如果建立的HDInsight集群采用的是较低版本的Hadoop,Sqoop任务执行过程中可能抛出以下错误,该错误的原因是在HDInsight集群中找不到对应的MySQL Driver,此类问题的手动解决方法是:用户手动copy相应的MySQL Driver(mysql-connector-java-5.1.36-bin.jar)到HDInsight的C:\apps\dist\sqoop-<version>\lib目录下。MySQL Driver可以在https://www.mysql.com/products/connector/下载。
15/09/21 03:35:12 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5.2.2.7.1-0004
15/09/21 03:35:12 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/21 03:35:13 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/21 03:35:13 INFO tool.CodeGenTool: Beginning code generation
15/09/21 03:35:13 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:848)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:736)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:759)
at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:269)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:240)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:226)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:295)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1773)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1578)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
[DBG]: PS C:\WINDOWS\system32>>
此问题也可以通过HDInsight中的action script来解决,使用脚本的方法安装MySQL Driver到目标目录。
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-script-actions/
本实践的完整PowerShell脚本如附件,Azure用户修改其中的(MySQL、Storage和HDInsight)参数既可应用。
参考文档:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-sqoop/
https://azure.microsoft.com/en-us/documentation/articles/storage-powershell-guide-full/
https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/