Share via


Customizing HDInsight Cluster provisioning

In my last blog, I discussed how we can specify Hadoop configurations for a job on an HDInsight cluster. At the end of that blog, I also dicussed the alternative approach where you may want to change certain hadoop configurations from default values and would like to preserve the changes throughout the lifetime of the cluster because, may be, the configurations have worked quite well for your workload during testing and apply to most of your jobs– you can do this via cluster customization while creating the HDInsight cluster. This approach also fits well with 'elastic hadoop in the cloud' scenario where you would create a customized HDinsight cluster with specific configurations, run your workload and then remove the cluster. While creating my own customized cluster, I realized that it was not very obvious from our existing documentation what different customization options are available or how to use those without digging through the reference documentation. In this blog, I wanted to share a few examples (a Powershell script and a .Net SDK example) with various customization options that can be used during HDInsight cluster provisioning.

Can we do it using Azure Portal?

The short answer is, yes – but with limitations. As shown in our HDInsight documentation, we can create a customized HDInsight cluster via our Azure Portal, Windows Azure Powershell or HDInsight .Net SDK. While I personally like the Azure Portal most for its simplicity and ease of use, not all the customization options are available via the portal, as of today – for example, customizing Hadoop configuration files or adding additional libraries or JARs during cluster provisioing, as shown in this codeplex example. Also, the UI restricts us to a certain number of additional storage accounts we can specify on the portal. The Windows Azure Powershell or HDInsight .Net SDK don't have such limitations and with these tools, you can use all the available customization options. Another benefit is, you can reuse the PowerShell script or .Net SDK code and make it part of your workflow.

The chart below shows a summary of a few important customizations that are available via portal, PowerShell and .Net SDK -

Example using Windows Azure PowerShell:

Here is a sample PowerShell script with examples of almost all the possible customization options during provisioning of a cluster. You can omit the customizations that you don't need.

 

Example using HDInsight .Net SDK:

Here is an equivalent cluster customization sample with HDInsight .Net SDK. Like before, omit the customizations you don't need.

 

Can we customize a cluster after Provisioning?

We can, but as explained in Dan's blog, outside of cluster customization during the install time, any manual modification of the Hadoop configuration files or any other file won't be preserved when the Azure VM nodes get updated - hence this is not recommended or supported. But the good news is, you can always customize or configure a Job and here are some of the possible options (not limited to)-

1. You can specify Hadoop configuration values for a job, as shown in this blog

2. You can use additional Azure Storage accounts (that are not associated with this HDInsight cluster) for a job, as shown in this TechNet article

3. You can upload a custom JAR to Window Azure Blob Storage and refer to that JAR from a job via MapReduce -libjars, Hive 'Add Jar' or Pig Register mechanisms.

That's all for today. I hope you find the blog helpful!

@Azim (MSFT)

Comments

  • Anonymous
    April 27, 2014
    Hi!I have the metastore DB not in AzureSQL but in a VM with SQL Server installed. I can't provision a cluster passing parameters to that instance. I need to RDP and manually change hive-site.xml file. Is not this supported?
  • Anonymous
    May 02, 2014
    @Tor,For Hive/Oozie Metastore, Azure SQL DB is the only supported database that we can specify during Provisioning. As you mentioned, you can manually change hive-site.xml and point to a SQL Server on Azure VM (IAAS) and this may work, but this may not be supported, as mentioned in the blog. If you like to see SQL Server on Azure VM as a supported database type in HDInsight, please feel free to provide your feedback on feedback.azure.com/.../217335-hdinsight
  • Anonymous
    June 01, 2014
    HiWhile provisioning a cluster using .NET SDK, can I use HDInsightAccessTokenCredential instead of certificate? Can you please provide samples where cluster provisioning is done using Active Directory access token (which I think is the one used in AccessTokenCredential)?
  • Anonymous
    June 02, 2014
    @Nishant,While HDInsight .Net SDK supports the use of Azure AD for cluster management and provisioning, I am not aware of any available HDInsight .Net SDK sample using Azure AD - we are going to request our content team to publish a sample.
  • Anonymous
    August 07, 2014
    HiDo we have the sample or document about how to use "HDInsightAccessTokenCredential" with Azure AD now?
  • Anonymous
    May 28, 2015
    How to we change the block size through rdp if we have already made the cluster for few nodes?
  • Anonymous
    May 28, 2015
    How to change the block size through rdp if we have already made the cluster of few nodes