Managing Your HDInsight Cluster with PowerShell

Artikel
06/07/2013

An updated version of this post can be found here.

This blog post provides a mechanism for managing an HDInsight cluster using a local management console through the use of Windows PowerShell. The goal is to outline how to configure the local management console, create a simple cluster, submit jobs using MRRunner, and finally provide a mechanism for managing an elastic service.

The ultimate goal will be to present a script that demonstrates how you can provide a DLL and have the script bring a cluster online, run your Job, and then remove the cluster, whilst allowing you to specify the cluster name and the number of hosts needed to run the Job.

All the scripts mentioned in this post can be downloaded from here.

Before provisioning a cluster one needs to ensure the Azure subscription has been correctly configured and a local management console has been setup. So lets start there.

Prepare Azure Subscription Storage

Under the subscription in which the HDInsight is going to be running one needs a storage account defined, and a management certificate uploaded.

Create a Storage Account

Next, under the storage account management create a new default container:

At this point the Azure subscription is basically ready. It is assumed that the subscription has the HDInsight preview enabled, see “Cluster Provisioning” below.

Setting Up Your Local Environment

Once the Azure subscription has been configured, the next step is to configure your local management environment.

The first step is to install and configure Windows Azure PowerShell. Windows Azure PowerShell is the scripting environment that allows you to control and automate the deployment and management of your workloads in Windows Azure. This includes provisioning virtual machines, setting up virtual networks and cross-premises networks, and managing cloud services in Windows Azure.

To use the cmdlets in Windows Azure PowerShell you will need to download and import the modules, as well as import and configure information that provides connectivity to Windows Azure through your subscription. For instructions, see Get Started with Windows Azure Cmdlets.

Prepare the Work Environment

The basic process for running the cmdlets by using the Windows Azure PowerShell command shell is as follows:

Upload a Management Certificate. For more information about how to create and upload a management certificate, see How to: Manage Management Certificates in Windows Azure and the section entitled “Create a Certificate”.
Install the Windows Azure module for Windows PowerShell. The install can be found on the Windows Azure Downloads page, under Windows.
Finally set the Windows PowerShell execution policy. To achieve this run the Windows Azure PowerShell as an Administrator and execute the following command:
Set-ExecutionPolicy RemoteSigned

The Windows PowerShell execution policy determines the conditions under which configuration files and scripts are run. The Windows Azure PowerShell cmdlets need the execution policy set to a value of RemoteSigned, Unrestricted, or Bypass.

Create a Certificate

From a Visual Studio Command Prompt run the following:

makecert -sky exchange -r -n "CN=<CertificateName>" -pe -a sha1 -len 2048 -ss My "<CertificateName>.cer"

Where <CertificateName> is the name that you want to use for the certificate. It must have a .cer extension. The command loads the private key into your user store, indicated by the “-ss My” switch. In the certmgr.msc, it appears in the path of Certificates Current User\Personal\Certificates.

Upload the new certification to Azure:

If one needs to connect from several machines the same certificate should be installed into the user store on that machine also.

Configure Connectivity

There are two ways one can configure connectivity between your local workstation and Windows Azure. Either manually by configuring the management certificate and subscription details with the Set-AzureSubscription and Select-AzureSubscription cmdlets, or automatically by downloading and importing the PublishSettings file from Windows Azure. The settings for Windows Azure PowerShell are stored in: <user>\AppData\Roaming\Windows Azure PowerShell.

The PublishSettings file method works well when you are responsible for a limited number of subscriptions, but it adds a certificate to any subscription that you are an administrator or co-administrator for. This can be achieved by:

Execute the following command:
Get-AzurePublishSettingsFile
Save the .publishSettings file locally.
Import the downloaded .publishSettings file by running the PowerShell command:
Import-AzurePublishSettingsFile <publishSettings-file>

Security note: This file contains an encoded management certificate that will serve as your credentials to administer all aspects of your subscriptions and related services. Store this file in a secure location, or delete it after you follow these steps.

In complex or shared development environments, it might be desirable to manually configure the publish settings and subscription information, including any management certificates, for your Windows Azure subscriptions. This is achieved by running the following script (for the management certificate):

$mySubID = "<subscritionID>" $certThumbprint = "<Thumbprint>" $myCert = Get-Item cert:\CurrentUser\My\$certThumbprint $mySubName = "<SubscriptionName>" $myStoreAcct = "mydefaultstorageaccount"

Set-AzureSubscription -SubscriptionName $mySubName -Certificate $myCert -SubscriptionID $mySubID

Set-AzureSubscription -DefaultSubscription $mySubName
Set-AzureSubscription –SubscriptionName $mySubName -CurrentStorageAccount $myStoreAcct
Select-AzureSubscription -SubscriptionName $mySubName

To view the details of the Azure Subscription run the command:

Get-AzureSubscription

This script uses the Set-AzureSubscription cmdlet. This cmdlet can be used to configure a default Windows Azure subscription, set a default storage account, or a custom Windows Azure service endpoint.

When managing your cluster and you have multiple subscription it is advisable to set the default subscription.

Install HDInsight cmdlets

Ensure the HDInsight binaries are accessible and loaded. They can be downloaded from:

https://hadoopsdk.codeplex.com/downloads/get/671067

Once downloaded and unzipped, into an accessible directory, and unblocked (open properties of the archive file and click Unblock), they are ready to be loaded into Azure PowerShell environment:

Import-Module "C:\ local_folder\Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.dll"

Cluster Provisioning

There are various ways one can provision a new cluster. If using the Management Portal there is a simple UX process one can follow:

However for an automated process it is recommended that PowerShell is utilized. In running the script to be presented one has to remember to load the HDInsight management binaries, as outlined in the “Install HDInsight cmdlets” step.

The following sections present a series of scripts for automating cluster management. Further examples can also be found at Microsoft .Net SDK for Hadoop - PowerShell Cmdlets for Cluster Management.

Provision Base Cluster

Firstly one needs access to the subscription Id and Certificate, for the current Azure Subscription, that was registered in the “Setting Up Your Environment” step.

When running these scripts the italicized variables will first need to be defined.

In this version you need to explicitly specify subscription information for cmdlets. The plan is after the cmdlets are integrated with Azure PowerShell tools this step won’t be necessary.
$subid = Get-AzureSubscription -Current | %{ $_.SubscriptionId }
$cert = Get-AzureSubscription -Current | %{ $_.Certificate }
Get the storage account key so you can use it later.
$key1 = Get-AzureStorageKey $myStoreAcct | %{ $_.Primary }
Finally create the new cluster; the cmdlet will return Cluster object in about 10-15 minutes when cluster is finished provisioning:
Write-Host "Creating '$numbernodes' Node Cluster named: $clustername" -f yellow
New-AzureHDInsightCluster -SubscriptionId $subid -Certificate $cert -Name $clustername -Location "East US" -DefaultStorageAccountName $blobstorage -DefaultStorageAccountKey $key1 -DefaultStorageContainerName $containerdefault -UserName $username -Password $password -ClusterSizeInNodes $numbernodes

As an example a set of variable definitions would be:

$mySubName = "Windows Azure Subscription Name"
$myStoreAcct = "mystorageaccount"
$blobstorage = "mystorageaccount.blob.core.windows.net"
$containerdefault = "hadooproot"
$clustername = "myhdinsighttest"
$location = "East US"
$numbernodes = 4
$username = "Admin"
$password = "mypassword"

Once the cluster has been created you will see the results displayed on the screen:

Name : myhdinsighttest
ConnectionUrl : https://myhdinsighttest.azurehdinsight.net
State : Running
CreateDate : 07/05/2013 14:16:37
UserName : Admin
Location : Admin
ClusterSizeInNodes : 4

You will also be able to see the cluster in the management portal:

If the subscription one is managing is not the configured default, before creating the cluster, one should execute the following command:

Set-AzureSubscription -DefaultSubscription $mySubName

To create a script that enables one to pass in parameters, such as the number of required hosts, you can have as your first few lines in the script the following:

Param($Hosts = 4, $Cluster = "myhdinsighttest")
$numbernodes = $Hosts
$clustername = $Cluster

The script can then be executed as follows (assuming saved as ClusterCreateSimple.ps1):

. "C:\\Scripts\ClusterCreateSimple.ps1" -Hosts 4 –Cluster "myhdinsighttest"

Of course other variables can also be passed in.

Currently the nodes supporting the creation of an HDInsight cluster are:

East US
North Europe

More will be added over time.

Manage Your Cluster

One can view your current clusters within the current subscription using the following command:

Get-AzureHDInsightCluster -SubscriptionId $subid -Certificate $cert

To view the details of a specific cluster use the following command:

Get-AzureHDInsightCluster $clustername -SubscriptionId $subid -Certificate $cert

To delete the cluster, needed for elastic services, run the command:

Remove-AzureHDInsightCluster $clustername -SubscriptionId $subid -Certificate $cert

There are a few words that need to be mentioned around versioning and cluster name.

When a cluster is currently provisioned you will always get the latest build. However if there are breaking changes between your current build and latest build currently executing code could break. Also, when you delete a cluster there is no guarantee that when you recreate the cluster the name will be available. To ensure that your name is not going to be used, abstract cluster names should be utilized.

Job Submission

In addition to managing the HDInsight cluster the Microsoft .Net SDK for Hadoop also allows one to write MapReduce jobs utilizing the .Net Framework. Under the covers this is employing the Hadoop Streaming interface. The documentation for creating such jobs can be found at:

https://hadoopsdk.codeplex.com/wikipage?title=Map%2fReduce

MRRunner Support

To submit such jobs there is a command-line utility called MRRunner that should be utilized. To support the MRRunner utility you should have an assembly (a .net DLL or EXE) that defines at least one implementation of HadoopJob<>.

If the Dll contains only one implementation of Hadoop<>, you can run the job with:

MRRunner -dll MyDll

If the Dll contains multiple implementations of HadoopJob<>, then you need to indicate the one you wish to run:

MRRunner -dll MyDll -class MyClass

To supply options to your job, pass them as trailing arguments on the command-line, after a double-hyphen:

MRRunner -dll MyDll -class MyClass -- extraArg1 extraArg2

These additional arguments are provided to your job via a context object that is available to all methods on HadoopJob<>.

Local PowerShell Submission

Of course one can submit jobs directly from a .Net executable and thus PowerShell scripts.

MapReduceJob is a property on Hadoop connection object. It provides access to an implementation of the IStreamingJobExecutor interface which handles the creation and execution of Hadoop Streaming jobs. Depending on how the connection object was created it can execute jobs locally, using Hadoop command-line, or against remote cluster using WebHCat client library.

Under the covers the MRRunner utility will invoke LocalStreamingJobExecutor implementation on your behalf; as discussed.

Alternatively one can invoke the executor directly and request it execute a Hadoop Job:

var hadoop = Hadoop.Connect();
hadoop.MapReduceJob.ExecuteJob<JobType>(arguments);

Utilizing this approach one can thus integrate Job Submission into cluster management scripts:

# Define Base Paths
$BasePath = "C:\Users\Me\Projects"
$dllPath = $BasePath + "\WordCountSampleApplication\bin"
$submitterDll = $dllPath + "\WordCount.dll"
$hadoopcmd = $env:HADOOP_HOME + "/bin/hadoop.cmd"

# Add submission file references
Add-Type -Path ($dllPath + "\microsoft.hadoop.client.dll")
Add-Type -Path ($dllPath + "\microsoft.hadoop.mapreduce.dll")
Add-Type -Path ($dllPath + "\microsoft.hadoop.webclient.dll")

# Define the Type for the Job
$submitterAssembly = [System.Reflection.Assembly]::LoadFile($submitterDll)
[Type] $jobType = $submitterAssembly.GetType("WordCountSampleApplication.WordCount", 1)

# Connect and Run the Job
$hadoop = [Microsoft.Hadoop.MapReduce.Hadoop]::Connect()
$job = $hadoop.MapReduceJob
$result = $job.ExecuteJob($jobType)

Write-Host "Job Run Information"
Write-Host "Job Id: " $result.Id
Write-Host "Exit Code: " $result.Info.ExitCode

The only challenge becomes managing the generic objects within PowerShell.

In this case I have used a version of the WordCountSample application that comes with the SDK. I have removed the Driver entry point and compiled the assembly into a DLL rather than an EXE.

Remote PowerShell Submission

Of course the ultimate goal is to submit the job from your management console. The process for this is very similar to the Local submission, with the inclusion of the connection information for the Azure cluster.

Hadoop Job Defined

In this sample the management objects are used, as in the provisioning case, to obtain the subscription information. It is assumed that the Job definition is contained within the code:

# Import the management module
Import-Module "C:\Users\Me\Cmdlets\Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.dll"

# Define Local Base Paths
$BasePath = "C:\Users\Me\Projects"
$dllPath = $BasePath + "\WordCountSampleApplication\bin"
$submitterDll = $dllPath + "\WordCount.dll"

# Define the Type for the Job
$submitterAssembly = [System.Reflection.Assembly]::LoadFile($submitterDll)
[Type] $jobType = $submitterAssembly.GetType("WordCountSampleApplication.WordCount", 1)

# Define the connection properties
[Uri] $cluster = New-Object System.Uri "https://myclustername.azurehdinsight.net:563"
$mySubName = "Windows Azure Subscription Name"
$clusterUsername = "Admin"
$clusterPassword = "mypassword"
$hadoopUsername = "Hadoop"
$storage = "mystorageaccount.blob.core.windows.net"
$containerdefault = "hadooproot"
[Boolean] $createContinerIfNotExist = $True

# Get the storage key
Set-AzureSubscription -DefaultSubscription $mySubName
$key = Get-AzureStorageKey $myStoreAcct | %{ $_.Primary }

# Connect and Run the Job
$hadoop = [Microsoft.Hadoop.MapReduce.Hadoop]::Connect($cluster, $clusterUsername, $hadoopUsername, $clusterPassword, $storage, $key, $containerdefault, $createContinerIfNotExist)
$job = $hadoop.MapReduceJob
$result = $job.ExecuteJob($jobType)

To execute the job one needs to define the type for the HadoopJob, which provides the necessary configuration details.

After running this job, from within the management portal, you should see the metrics for the number of mapper and reducers executed be updated:

Unfortunately at this point in time you will not see your job within the job history when managing the cluster. For this you will have to use the standard Hadoop UI interfaces; accessible by connecting to the head node of the cluster.

PowerShell Configuration

If one wants to provide the configuration details for the job within the PowerShell environment then the code is slightly different:

Param($Cluster = "myhdinsighttest", [string] $InputPath = $(throw "Input Path Required."), $OutputFolder = $(throw "Output Path Required."))

# Import the management module
Import-Module "C:\Users\Me\Cmdlets\Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.dll"

# Define Local Base Paths
$dllPath = "C:\Users\Me\Projects\WordCountSampleApplication\bin"
$submitterDll = $dllPath + "\WordCount.dll"

# Define the Types for the Job (Mapper. Reducer, Combiner)
$submitterAssembly = [System.Reflection.Assembly]::LoadFile($submitterDll)
[Type] $mapperType = $submitterAssembly.GetType("WordCountSampleApplication.WordCountMapper", 1)
[Type] $reducerType = $submitterAssembly.GetType("WordCountSampleApplication.WordCountReducer", 1)
[Type] $combinerType = $submitterAssembly.GetType("WordCountSampleApplication.WordCountReducer", 1)

# Define the configuration properties
$config = New-Object Microsoft.Hadoop.MapReduce.HadoopJobConfiguration
[Boolean] $config.Verbose = $True
$config.InputPath = $InputPath
$config.OutputFolder = $OutputFolder
$config.AdditionalGenericArguments.Add("-D""mapred.map.tasks=3""")

# Define the connection properties
$clustername = "https://$Cluster.azurehdinsight.net:563"
[Uri] $clusterUri = New-Object System.Uri $clustername
$mySubName = "Windows Azure Subscription Name"
$clusterUsername = "Admin"
$clusterPassword = "mypassword"
$hadoopUsername = "Hadoop"
$storage = "mystorageaccount.blob.core.windows.net"
$containerdefault = "hadooproot"
[Boolean] $createContinerIfNotExist = $True

# Get the storage key
Set-AzureSubscription -DefaultSubscription $mySubName
$key = Get-AzureStorageKey $myStoreAcct | %{ $_.Primary }

In this instance a HadoopJobConfiguration is created which allows for the definition of input and output files, along with other Job parameters. To execute the Job one has to provide the types for the Mapper, Reducer, and Combiner. If a Reducer or Combiner are not needed then the parameter value of $Null can be used.

One has to remember as we are submitting the job to an Azure cluster the default file system will be ASV. As such, to run this sample I uploaded some text files to the appropriate Azure storage location:

To run this script one just needs to execute the following command (assuming saved as ClusterJobConfigSubmission.ps1):

. "C: \Scripts\ClusterJobConfigSubmission.ps1" –Cluster “clustername” -InputPath "/wordcount/input" -OutputFolder "/wordcount/output"

The output then will show up in the wordcount/output container.

Providing an Elastic Service

The idea behind an Elastic Service is that the cluster can be brought up, with the necessary hosts, when Job execution is necessary. To achieve this, with the provided script, one can use the following script:

Param($Hosts = 4, $Cluster = "clustername", [string] $InputPath = $(throw "Input Path Required."), [string] $OutputFolder = $(throw "Output Path Required."))

# Create the Cluster
. ".\ClusterCreateSimple.ps1" -Hosts $Hosts -Cluster $Cluster

# Execute the Job
. ".\ClusterRemoteJobConfigSubmission.ps1" -Cluster $Cluster -InputPath $InputPath -OutputFolder $OutputFolder

# Delete the Cluster
. ".\ClusterDelete.ps1" -Cluster $Cluster

To execute the script one just has to execute (assuming saved as ClusterElasticJobSubmission.ps1):

. "C:\Scripts\ClusterElasticJobSubmission.ps1" -Hosts 4 -Cluster "clustername" -InputPath "/wordcount/input" -OutputFolder "/wordcount/output"

This will create the cluster, run the required job, and then delete the cluster. During the creation the cluster name and number of hosts is specified, and during the job submission the input and output paths are specified. One could of course customize these scripts to include additional parameters such as number of mappers, additional job arguments, etc.

This process is possible because the storage used for input and output is Azure Blob Storage. As such the cluster is only needed for compute operations and not storage.

Further customization of the script allows for the actual Mapper and Reducers to be specified during the submission. If you have previously configured your connectivity, as outlined in the Configure Connectivity section, with a default subscription:

Set-AzureSubscription -DefaultSubscription $mySubName

Then when submitting Jobs you can derive most of the necessary subscription information and have a very general Job submission utility that uses parameters for all job specific configurations:

Param($Hosts = 4, $Cluster = "myclustername", [string] $StorageContainer = $(throw "Storage Container Required."), [string] $InputPath = $(throw "Input Path Required."), [string] $OutputFolder = $(throw "Output Path Required."), [string] $Dll = $(throw "MapReduce Dll Path Required."), [string] $Mapper = $(throw "Mapper Type Required."), [string] $Reducer = $Null, [string] $Combiner = $Null)

# TODO Set to location of cmdlets
$modulePath = "C:\Users\Me\Cmdlets"

# Define all local variable names for script execution
$submitterDll = $Dll
$dllPath = Split-Path $submitterDll -parent
$dllFile = Split-Path $submitterDll -leaf

# Import the management module and set file references
Import-Module "$modulePath\Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.dll"
Add-Type -Path ($dllPath + "\microsoft.hadoop.client.dll")
Add-Type -Path ($dllPath + "\microsoft.hadoop.mapreduce.dll")
Add-Type -Path ($dllPath + "\microsoft.hadoop.webclient.dll")

# Get the subscription information and set variables
$subscriptionInfo = Get-AzureSubscription -Current
$subName = $subscriptionInfo | %{ $_.SubscriptionName }
$subId = $subscriptionInfo | %{ $_.SubscriptionId }
$cert = $subscriptionInfo | %{ $_.Certificate }
$storeAccount = $subscriptionInfo | %{ $_.CurrentStorageAccount }
$key = Get-AzureStorageKey $storeAccount | %{ $_.Primary }
$storageAccountInfo = Get-AzureStorageAccount $storeAccount
$location = $storageAccountInfo | %{ $_.Location }

$clusterUsername = "Admin"
$clusterPassword = [System.Web.Security.Membership]::GeneratePassword(20, 5)
$hadoopUsername = "Hadoop"
$clusterName = $Cluster
$containerDefault = $StorageContainer
$numberNodes = $Hosts
$clusterHttp = "https://$clusterName.azurehdinsight.net:563"
$blobStorage = "$storeAccount.blob.core.windows.net"

# Create the cluster
Write-Host "Creating '$numbernodes' Node Cluster named: $clusterName" -f yellow
Write-Host "Storage Account '$storeAccount' and Container '$containerDefault'" -f yellow
Write-Host "User '$clusterUsername' Password '$clusterPassword'" -f green
New-AzureHDInsightCluster -SubscriptionId $subId -Certificate $cert -Name $clusterName -Location $location -DefaultStorageAccountName $blobStorage -DefaultStorageAccountKey $key -DefaultStorageContainerName $containerDefault -UserName $clusterUsername -Password $clusterPassword -ClusterSizeInNodes $numberNodes
Write-Host "Created '$numbernodes' Node Cluster: $clusterName" -f yellow

# Define the Types for the Job (Mapper. Reducer, Combiner)
$submitterAssembly = [System.Reflection.Assembly]::LoadFile($submitterDll)
[Type] $mapperType = $submitterAssembly.GetType($Mapper, 1)
[Type] $reducerType = $Null
if ($Reducer) { $reducerType = $submitterAssembly.GetType($Reducer, 1) }
[Type] $combinerType = $Null
if ($Combiner) { $combinerType = $submitterAssembly.GetType($Combiner, 1) }

# Define the connection properties
[Boolean] $createContinerIfNotExist = $True
[Uri] $clusterUri = New-Object System.Uri $clusterHttp

# Connect and Run the Job
Write-Host "Executing Job Dll '$dllFile' on Cluster $clusterName" -f yellow

$hadoop = [Microsoft.Hadoop.MapReduce.Hadoop]::Connect($clusterUri, $clusterUsername, $hadoopUsername, $clusterPassword, $blobStorage, $key, $containerDefault, $createContinerIfNotExist)
$job = $hadoop.MapReduceJob
$result = $job.Execute($mapperType, $reducerType, $combinerType, $config)

Write-Host "Job Run Information" -f yellow
Write-Host "Job Id: " $result.Id
Write-Host "Exit Code: " $result.Info.ExitCode

# Finally delete the cluster
Write-Host "Deleting Cluster named: $clusterName" -f yellow
Remove-AzureHDInsightCluster $clusterName -SubscriptionId $subId -Certificate $cert
Write-Host "Cluster $clusterName Deleted" -f yellow

In this instance to run the job one would use the following command (assuming saved as ClusterElasticJobConfigSubmissionAuto.ps1):

. "C:\Users\Me\Scripts\ClusterElasticJobConfigSubmissionAuto.ps1" -Hosts 4 -Cluster "myclustername" -StorageContainer "hadooproot" -InputPath "/wordcount/input" -OutputFolder "/wordcount/output" -Dll "C:\Users\Me\bin\WordCount.dll" -Mapper "WordCountSampleApplication.WordCountMapper" -Reducer "WordCountSampleApplication.WordCountReducer" -Combiner "WordCountSampleApplication.WordCountReducer"

In this instance the only additional piece of information needed, giving you a completely reusable script, is the default storage container for the storage account. The script also generates a password, which is displayed in case the cluster is not successfully deleted after the job execution.

Of course other parameters could be included in this model to make the PowerShell script more applicable to more Job types.

Specifying ASV Paths

As mentioned the default schema when submitting MapReduce Jobs is Azure Blob Storage. This means the default container specified when creating the cluster is used as the root.

In the example above the path “/wordcount/input” actually equates to (assuming a storage account of “dataplatformrdpcarl”):

asv://hadooproot@dataplatformrdpcarl.blob.core.windows.net/wordcount/input

Thus if one wanted to operate out of a different default container such as “data” fully-qualified names could be specified such as:

-InputPath "asv://data@dataplatformrdpcarl.blob.core.windows.net/books"
-OutputFolder "asv://data@dataplatformrdpcarl.blob.core.windows.net/wordcount/output"

This would equate to the “data” container: