Udostępnij za pośrednictwem


Getting started with the HDInsight PowerShell tools and SDK

Hi, my name is Azim and I work on the Big Data Support Team at Microsoft. If you have had a chance to read an earlier post by Dharshana, you may have seen how we can submit Hive query using the HDInsight PowerShell tools. In this blog, we will cover some basics of the HDInsight PowerShell tools and SDK (aka HdInsight SDK) – hopefully this will help clarify a few things around the HDInsight SDK and get you up and running with the SDK!

Why HDInsight SDK?

If all of us were happy to manage and access Hadoop components by logging on to a cluster node and run jobs manually, we wouldn't probably need any SDK. But that's not the reality! We all love to be able to manage or interact with our services/clusters remotely from our workstations–we also would like to run jobs or applications programmatically. With the HDInsight SDK, we can access and interact with HdInsight cluster remotely from our workstations, using tools and technologies that we all are familiar with - .Net Framework and Windows PowerShell. With the SDK, we can provision or manage cluster or run a job (MapReduce, Hive, Pig etc.) programmatically – thus allowing to make it a part of a rich workflow or other scheduled jobs. Without the existence of the SDK, for remote and programmatic access to Hadoop, each of us would have to find our own way of coding and scripting to use the native REST API that Hadoop or HDInsight exposes – the SDK hides some of the underlying details from end users and makes it easier. The fact that the HDInsight SDK is based on PowerShell and .Net means we can easily integrate the script/code with existing .Net applications.

What is the HDInsight SDK?

Since we have used a few different names to call it, I have seen some confusion around the naming – for example, 'Microsoft .Net SDK for Hadoop' vs. 'HDInsight SDK' – let me make an attempt to clarify it J

It all started as hadoopsdk codeplex project– and is known as Microsoft .Net SDK for Hadoop. Microsoft .Net SDK for Hadoop is open source and has both Incubator and Released components as described in the roadmap– with the release of Windows Azure HDInsight, we have made the following SDK components as 'Released'

  • Windows Azure HDInsight PowerShell
  • HDInsight .Net SDK
  • Cross Platform CLI tools (or Node.js CLI tools)

Together, we can think of the above Released components as the 'HDInsight SDK' – this aligns with the overall HDInsight umbrella - you may also hear the term HDInsight PowerShell tools and .Net SDK or HDInsight Tools and SDK. The HDInsight SDK components are now integrated with Windows Azure tools and SDK – for example, the HDInsight PowerShell tools are integrated with Windows Azure PowerShell tools, the HDInsight .Net SDK code is under Microsoft.WindowsAzure.Management.HDInsight namespace etc. The above SDK components are also fully tested, production-ready and are supported by Microsoft CSS. Here is a summarized view of the SDK-

 

Where do I get the HDInsight SDK?

HDInsight PowerShell Tools:

Our HDInsight documentation here has detailed steps for installing and configuring the HDInsight PowerShell tools. Here is a quick rundown of setup/configure steps for the HDInsight Powershell tools-

I will cover the install part in this section and configure part in the next section. Follow these steps to install –

  1. Install the Windows Azure PowerShell tools from here
  2. Install the HDInsight PowerShell tools from here, you may need to restart the machine.

After you install the HDInsight PowerShell tools, Open the Windows azure PowerShell console on the workstation and run the following cmdlet

Get-Command -Module *HDInsight* | Format-Table -Property Name

You may get an output like the screenshot below – this is one way to verify that the HDInsight PowerShell tools are installed successfully.

But, we are not ready to use the HDInsight PowerShell tools yet! Next step is to prepare your workstation, please review the section 'Preparing your workstation to use the HDInsight SDK'

 

HDInsight .Net SDK:

The HDInsight .Net SDK uses Nuget distribution model, which means, you need to install the HDInsight .Net SDK Nuget packages for every Visual studio Project where you intend to use the .Net SDK during development. When you are ready to deploy the code in production, you distribute the .Net SDK DLLs with the application binaries.

Here is a quick view of setup/configure steps for the HDInsight .Net SDK -

To install HDInsight .Net SDK, follow the steps –

  1. Create a new Visual Studio (2013, 2012 or 2010 –any edition) project or open an existing project

  2. Go to Tools -> Library Package Manager and select 'Package Manager Console' as shown below -

  3. For Visual studio 2012, Package Manager Console will appear at the bottom and you can run the following command to install the package –

    Install-Package Microsoft.WindowsAzure.Management.HDInsight

     

  4. After the package is installed successfully, you will see a file called "packages.config" added to your Visual studio Project, with each package added, as shown below. Project References will be updated as well with related DLLs. 

 

       5. Next step is to prepare your workstation, please review the section ' Preparing your workstation to use the HDInsight SDK'

 

Cross Platform CLI tools:

Our HDInsight documentation here does a great job in describing the steps of installing and configuring the tools – please check it out if you need to access HDInsight from a Non-Windows platform like Linux, Mac etc.

 

Preparing your workstation to use the HDInsight SDK

The HDInsight SDK components (PowerShell tools, .Net SDK or Node.js tools) require your Windows Azure subscription information so that it can be used to manage your services. HDInsight SDK leverages Azure Management Certificate to authenticate while accessing subscription resources. There are a few ways you can obtain an Azure Management Certificate-

  1. Create a self-signed certificate following the steps described in Windows azure documentation here
  2. Via Azure PublishSettings file

This blog explains nicely what Azure PublishSettings is and how it works, but here are some takeaways–

What is a Windows Azure Management Certificate?

A Windows Azure management certificate is an X.509 v3 certificate used to authenticate an agent, such as Visual Studio Tools for Windows Azure or a client application that uses the Service Management API, acting on behalf of the subscription owner to manage subscription resources. Windows Azure management certificates are uploaded to Windows Azure and stored at the subscription level. The management certificate store can hold up to 100 certificates per subscription. These certificates are used to authenticate your Windows Azure deployment.

What is a PublishSettings file and how does it work?

A publish settings file is an XML file which contains information about your subscription. It contains information about all subscriptions associated with a user's Live Id (i.e. all subscriptions for which a user is either an administrator or a co-administrator). It also contains a management certificate which can be used to authenticate Windows Azure Service Management API requests.

So when we request a publish settings file from Windows Azure, what Windows Azure does is that it creates a new management certificate and attaches that certificate to all of your subscriptions. The publish settings file contains raw data of that certificate and all your subscriptions. Any tool which supports this functionality would simply parse this XML file, reads the certificate data and installs that certificate in your local certificate store (usually Current User/Personal (or My)). Since Windows Azure Service Management API makes use of certificate based authentication and same certificate is present in both Windows Azure Management Certificates Section for your subscription and in your local computer's certificate store, authentication works seamlessly.

 

Getting Azure Management certificate via Publishsettings file:

On each workstation you plan to use the HDInsight SDK (PowerShell or .Net SDK), you can use the following steps to obtain an Azure management certificate-

  1. Sign in to the Windows Azure Management Portal using the credentials for your Windows Azure account.

2.  Once the logon to Azure is complete and Portal is open, run the Windows Azure PowerShell command to get the settings file –

Get-AzurePublishSettingsFile

    The Get-AzurePublishSettingsFile cmdlet opens a web page on the [Windows Azure Management Portal] from which you can download the subscription information. The information is contained in a .publishsettings file.

3.  Import the Azure settings file to be used by Windows Azure cmdlets, by running the cmdlet –

             Import-AzurePublishSettingsFile '<Folder>\YourSubscriptionName-DownlodDate-credentials.publishsettings'

     Here, '<Folder>\YourSubscriptionName-DownlodDate-credentials.publishsettings' is the file you saved in step 2 on your workstation.

            Import-AzurePublishSettingsFile cmdlet does two things -

a.  It parses this AzurePublishSettingsFile XML file, reads the certificate data and installs that certificate in your local certificate store (usually Current User/Personal) - The certificate has 'Windows Azure Tools' as 'Issued to' and 'Issued By'.

           b.  It create a file called 'WindowsAzureProfile.xml' under the folder 'C:\Users\userName\AppData\Roaming\Windows Azure PowerShell' - The file contains Subscription Name, SubscriptionId and Azure certificate Thumbprint etc.

      4.   You are now ready to connect to your subscription and use the HDInsight PowerShell tools and .Net SDK. To view Windows Azure subscription info, run the following Windows azure cmdlet – 

Get-AzureSubscription

 

Running PowerShell script:

You can either run the HDInsight cmdlets directly on Windows Azure PowerShell console or save the script as a file with the .ps1 extension, and run the script file from the Windows Azure PowerShell console. Before you can run a script, you must run the following command from an elevated command prompt to set the execution policy to RemoteSigned:

Set-ExecutionPolicy RemoteSigned

 

What can I do with the HDInsight SDK?

Now that you have installed the HDInsight SDK and prepared your workstation to use it, what can you do with it?

With the current release of the HDInsight SDK, it provides a number of important functionalities around a Windows Azure HDInsight Cluster–

Our HDInsight documentation here has some great examples of how you can use HDInsight PowerShell or .Net SDK to provision clusters. More examples of using PowerShell to manage clusters can be found here. For submitting jobs programmatically, you can review the examples here or review the samples here

Here is a simple example of running a wordcount MapReduce job via the HDInsight PowerShell cmdlets-

# define subscription ID and cluster name
$subid = Get-AzureSubscription -Current | %{ $_.SubscriptionId }
$clusterName = "HDInsightClusterName"  

# define the word count MapReduce Job
$wordCountJobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile "/example/jars/hadoop-examples.jar" -ClassName "wordcount"
$wordCountJobDef.Arguments.Add("/example/data/gutenberg/davinci.txt")
$wordCountJobDef.Arguments.Add("/example/output/WordCount")  

# Submit the MapReduce job
$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -Subscription $subid -JobDefinition $wordCountJobDef  

# Wait for the job to complete
Wait-AzureHDInsightJob -Subscription $subid -Job $wordCountJob -WaitTimeoutInSeconds 3600  

# Get the job standard error output
Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subid -JobId $wordCountJob.JobId -StandardError  

 

How do I get help on the HDInsight SDK?

HDInsight PowerShell tools implement the get-help framework of Windows Azure PowerShell, which is kind of nice and helpful to show what parameters a cmdlet requires or supports.

For example, if you wanted to know the usage of cmdlet "New-AzureHDInsightCluster", you would type on a Windows Azure PowerShell console –

help New-AzureHDInsightCluster    

Sample output –

And then I would typically use get-help <cmdlet> -full to see the required parameters.

In addition to our Azure HDInsight documentation and samples on SDK (some of the links mentioned in the previous section), please feel free to review reference documentation for PowerShell and .Net SDK or contact us in CSS.

 

That's it for today, I hope you have enjoyed the post on HDInsight SDK – looking forward to your comment or feedback J

@Azim (MSFT)