Jaa


Running Pig and Hive jobs with Windows PowerShell

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

You can submit Pig and Hive jobs in a Windows PowerShell script by using the New-AzureHDInsightPigJobDefinition and New-AzureHDInsightHiveJobDefiniton cmdlets.

After defining the job you can initiate it with the Start-HDInsightJob cmdlet, wait for it to complete with the Wait-AzureHDInsightJob cmdlet, and retrieve the completion status with the Get-AzureHDInsightJobOutput cmdlet.

The following code example shows a PowerShell script that executes a Hive job based on hard-coded HiveQL code in the PowerShell script. A Query parameter is used to specify the HiveQL code to be executed, and in this example some of the code is generated dynamically based on a PowerShell variable.

$clusterName = "cluster-name"
$tableFolder = "/data/mytable"

$hiveQL = "CREATE TABLE mytable"
$hiveQL += " (id INT, val STRING)"
$hiveQL += " ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"
$hiveQL += " STORED AS TEXTFILE LOCATION '$tableFolder';"

$jobDef = New-AzureHDInsightHiveJobDefinition -Query $hiveQL

$hiveJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef

Write-Host "HiveQL job submitted..."

Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600

Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId -StandardError

As an alternative to hard-coding HiveQL or Pig Latin code in a PowerShell script, you can use the File parameter to reference a file in Azure storage that contains the Pig Latin or HiveQL code to be executed. In the following code example a PowerShell script uploads a Pig Latin code file that is stored locally in the same folder as the PowerShell script, and then uses it to execute a Pig job.

$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"

# Find the folder where this PowerShell script is saved
$localfolder = Split-Path -parent $MyInvocation.MyCommand.Definition

$destfolder = "data/scripts"
$scriptFile = "ProcessData.pig"

# Upload Pig Latin script to Azure Storage 
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName $storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
$blobName = "$destfolder/$scriptFile"
$filename = "$localfolder\$scriptFile"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -Context $blobContext -Force
write-host "$scriptFile uploaded to $containerName!"

# Run the Pig Latin script
$jobDef = New-AzureHDInsightPigJobDefinition -File "wasb:///$destfolder/$scriptFile"
$pigJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef
Write-Host "Pig job submitted..."

Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600

Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError

In addition to the New-AzureHDInsightHiveJobDefinition cmdlet, you can execute HiveQL commands using the Invoke-AzureHDInsightHiveJob cmdlet (which can be abbreviated to Invoke-Hive). Generally, when the purpose of the script is simply to retrieve and display the results of Hive SELECT query, the Invoke-Hive cmdlet is the preferred option because it requires significantly less code. For more details about using Invoke-Hive, see Querying Hive tables with Windows PowerShell.

Next Topic | Previous Topic | Home | Community