Pushing Hadoop Cluster Configuration Changes using PowerShell

In my previous post I talked about Implementing and Deploying Rack Awareness using PowerShell. However PowerShell is a great tool for not only managing things like Rack Awareness but for installing and managing the Hadoop cluster; especially for managing configuration changes, the focus of this post.

All the files relating to post can be found here, along with the script for rack awareness.

So what would be involved in pushing out a set of configuration changes to all nodes in the cluster?

If you consider each node in the cluster you would need:

  • A script that modifies a property name/value pair for a given configuration file on the cluster node itself
  • A script, along with corresponding configuration specifications, for making all the required configuration changes that are required on the cluster node

So lets tackle these first.

As a recap, within each configuration file (source) the property names and values have the following format:

 <property>
  <name>topology.script.file.name</name>
  <value>hadoop-rack-configuration.cmd</value>
</property>

Thus, the starting point for managing configuration changes would be a set of configuration changes that one would like to make to the cluster nodes. In this implementation the configuration file would look like the following:

 configure-cluster-properties.txt
 # YARN Configuration
yarn-site.xml, yarn.nodemanager.resource.memory-mb, 8192
yarn-site.xml, yarn.scheduler.minimum-allocation-mb, 2048
 
# MR Configurations
mapred-site.xml, mapreduce.map.memory.mb, 2048
mapred-site.xml, mapreduce.reduce.memory.mb, 4096
mapred-site.xml, mapreduce.map.java.opts, -Xmx1536m
mapred-site.xml, mapreduce.reduce.java.opts, -Xmx3072m

This is basically a list of the configuration source, property name, and property values, comma separated, for all the required changes to be made to each node.

Based on these three values as inputs to make a single configuration change the PowerShell script would be as follows:

 configure-cluster-property.ps1
 param([string] $config_file_name, [string] $config_property_name, [string] $config_property_value)
 
if(Test-Path -Path $config_file_name -PathType Leaf) {
    $conf_doc = [System.Xml.XmlDocument](Get-Content $config_file_name)
 
    $property_node = ($conf_doc.DocumentElement.property | Where-Object {$_.name -eq $config_property_name})
    If ($property_node) {
        # Element found so ensure the property is correctly set
        write-host "$config_property_name Element Found, so updating value to $config_property_value..."
        $property_node.Value = $config_property_value
    } else {
        # No Element found so add a new one to the document
        write-host "$config_property_name Element Not Present, adding new element..."
        $property_element = $conf_doc.CreateElement("property")
 
        $property_element_name = $conf_doc.CreateElement("name")
        $property_element_name.AppendChild($conf_doc.CreateTextNode($config_property_name))
        $property_element.AppendChild($property_element_name)
 
        $property_element_value = $conf_doc.CreateElement("value")
        $property_element_value.AppendChild($conf_doc.CreateTextNode($config_property_value))
        $property_element.AppendChild($property_element_value)
 
        $conf_doc.DocumentElement.AppendChild($property_element)
    }
 
    $conf_doc.Save($config_file_name)
} else {
    Write-Error "Configuration File $config_file_name cannot be found"
}

This script opens the configuration document as an XmlDocument and then locates the property Element, where the name Element matches the one requiring modification. If the corresponding property Element is found then its value is changed. If the Element is not found then the property Element is added to the document.

To use this script the input file of configuration changes just needs to be parsed and for each configuration line a call is made to the property change modification script:

 configure-cluster-properties.ps1
 param([string] $working_path, [string] $configuration_file)
 
$hadoop_config_dir = $env:HADOOP_CONF_DIR
$configurationFile = "$working_path\$configuration_file";
 
function Get-PropertiesTable {
 
    $properties = @(); 
 
    if (Test-Path -Path $configurationFile -PathType Leaf) {    
        $propertylines = Get-Content $configurationFile
        foreach ($propertyline in $propertylines)
        {
            $propertyline = $propertyline.Trim()
            if (($propertyline) -and (-not $propertyline.StartsWith("#")))
            {
                $propertyline_values = ,@($propertyline -Split ",|\t") | % {$_.Trim()}
                if ($propertyline_values.length -eq 3) {
                    $hdp_file = $propertyline_values[0]
                    $properties += New-Object PSObject –Property @{ConfigFile="$hadoop_config_dir\$hdp_file"; PropertyName=$propertyline_values[1]; PropertyValue=$propertyline_values[2]}
                }
            }
        }            
    } else {
        Write-Error "Configuration File $configurationFile cannot be found..."
    }
 
    return $properties;
}
 
Get-PropertiesTable | % {. $working_path\configure-cluster-property.ps1 -config_file_name $_.ConfigFile -config_property_name $_.PropertyName -config_property_value $_.PropertyValue }

Calling this script on a local node is then easily achieved using a command line file:

 set working_path=%~dp0
if %working_path:~-1%==\ set working_path=%working_path:~0,-1%
 
set script_path=%working_path%\configure-cluster-properties.ps1
set configuration_file=configure-cluster-properties.txt
 
PowerShell -NoProfile -ExecutionPolicy Bypass -Command "& '%script_path%' -working_path '%working_path%' -configuration_file '%configuration_file%'"
 
pause

However having to do this on possibly hundreds of remote machines is not feasible; especially if your cluster consists of hundreds of nodes.

Once again PowerShell provides an easy solution. A script can be created that can be used to push these configuration files to all the cluster nodes and then to execute the configuration change script remotely on all the nodes:

 push_configure_cluster_properties.ps1
 param( [string] $source_path, [string] $target_path, [string] $files_list, [string] $nodes_list, [int] $batch_size, [string] $cmd_word ) # followed by $cmd_args
 
$cmd_args = $args  # capture this right away so can use it in functions
 
function ListSplitUnique($values)
{
    return ,@((($values -Split ",") | % {$_.Trim()} | where-object {$_ -ne ""} ) | Sort-Object -Unique)
}
 
function GetRequiredNodes($nodelistpath)
{
    $required_nodes = @()
    write-host "$nodelistpath"
    if ($nodelistpath -notlike "skip") {
        if (Test-Path -Path $nodelistpath -PathType Leaf){
            # // File exists
            (get-content $nodelistpath) | foreach-object {$required_nodes += ListSplitUnique($_)}
        } else {
            # // File does not exist
            $required_nodes = ListSplitUnique($nodelistpath)
        }
    }
 
    return $required_nodes;
}
 
function GetIPAddress($hostname)
{
    try
    {
        [System.Net.Dns]::GetHostAddresses($hostname) | ForEach-Object { if ($_.AddressFamily -eq "InterNetwork") { $_.IPAddressToString } }
    }
    catch
    {
        throw "Error resolving IPAddress for host '$hostname'"
    }
}
 
function IsSameHost(
    [string] [parameter( Position=0, Mandatory=$true )] $host1,
    [array] [parameter( Position=1, Mandatory=$false )] $host2ips = ((GetIPAddress $env:COMPUTERNAME) -as [array]))
{
    $host1ips = ((GetIPAddress $host1) -as [array])
    $heq = Compare-Object $host1ips $host2ips -ExcludeDifferent -IncludeEqual
    return ($heq -ne $null)
}
 
#convert $files_list to an array
$all_files = @()
if ($files_list -ne $null) {
    $all_files = ListSplitUnique($files_list)
}
 
#convert $nodes_list to an array
$required_nodes = GetRequiredNodes($nodes_list);
 
# define some global values
$all_nodes = @()
$nl = [Environment]::NewLine
$current_host = gc env:computername
$ips = ((GetIPAddress $env:COMPUTERNAME) -as [array])
 
 
function Check-Files 
{    
    # check for missing arguments
    if (($all_files.Count -lt 2) -or ($cmd_word = $null)) {
        write-error "Usage: push_configure_hdp_dependencies.ps1 <source_path> <target_path> <files_list> <cmd_line...>
Files_list is a string containing a comma-delimited list of simple
file names, to be found in source_path and copied to target_path on
each node.  Cmd_line is a sequence of command line tokens constituting 
a valid PowerShell execution."
        Exit -1;
    }
    
    # validate that $target_path is an absolute path including volume name
    $path_pieces = $target_path.Split(":")
    if ($path_pieces.Length -lt 2) {
        write-error "$target_path path does not include volume name with colon.  Exiting."
        Exit -1;
    }
    if ($path_pieces.Length -gt 2) {
        write-error "$target_path path has multiple colons.  Exiting."
        Exit -1;
    }
    
    # validate existence and accessibility of files in the source_path
    if (! (Test-Path $source_path)) {
        write-error "$source_path is not accessible from install master server.  Exiting."
        Exit -1;
    }
    cd "$source_path"
    write-output "$($nl)Push install files: "
    ls $all_files
    if (! $?) {
        write-error "Some requested files are missing from $source_path.  Exiting."
        Exit -1
    }
    write-output "$nl" 
}
 
function Summarize-Results($results_set)
{
    $remoteErrors = @()
    $remoteSuccessCount = 0
    $remoteFailureCount = 0
    $remoteOutputResult = 0
    
    foreach ( $result in $results_set ) {
        if ( $result -is [System.Management.Automation.ErrorRecord] ) {
            $remoteFailureCount++
        }
        ElseIf ($result -eq "Done.") {
            $remoteSuccessCount++ 
        } Else {
            $remoteOutputResult++
        }
    }
    
    return New-Object psobject -Property @{
        remoteErrors = $results_set;
        remoteSuccessCount = $remoteSuccessCount;
        remoteFailureCount = $remoteFailureCount;
        remoteOutputResult = $remoteOutputResult
    }
}
 
function Report-Results($summary_results) {
    # Report summarized results to user
    # $summary_results must be the output of Summarize-Results()    
    write-output "Summary:"
    write-output ("" + $results.remoteSuccessCount + " nodes successfully completed")
    
    if ($results.remoteFailureCount -gt 0) {
        write-output ("" + $results.remoteFailureCount + " failure messages.")
    }
    
    if ($results.remoteOutputResult -gt 0) {
        write-output ("" + $results.remoteOutputResult + " Output Lines.")
    }
}
 
function CopyInstall-Files($node)
{
    if (-not ((($node -ieq $current_host) -or (IsSameHost $node $ips)) -and ($source_path -ieq $target_path)))  {
        # convert $target_path into the corresponding admin share path, so we can push the files to the node
        $tgtDir = '\\' + ($node) + '\' +  $target_path.ToString().Replace(':', '$')
 
        # attempt to create the target install directory on the remote node if it doesn't already exist
        if(! (Test-Path "$tgtDir")) { $r = mkdir "$tgtDir" }
 
        # validate that the $tgtDir admin share exists and is accessible on the remote node
        if (! (Test-Path "$tgtDir") ) {
            write-error "$target_path on $node is not accessible by admin share.  Skipping."
            return $false
        }
 
        # push the files to each node.  Skip node if any errors.
        cd "$source_path"
        cp $all_files "$tgtDir" -Force -ErrorAction Stop
        if (! $?) {
            write-error "Some files could not be pushed to $node.  Skipping."
            return $false
        }
        
        return $true
    }
    else {
        return $true
    }
}
 
function PushInstall-Files($nodes)
{
    $copied_nodes = @()
    
    foreach ($node in $nodes) {
        # tag output with node name:
        write-output ("Copying files for node $node $nl")    
        
        # Copy the install files and build list of those needed an install
        if (CopyInstall-Files($node)) {
            if (($node -ieq $current_host) -or (IsSameHost $node $ips)){
                write-output "Skipping $node because it is current node $nl"
            }        
            else {
                write-output ("Copied files to node $node $nl")    
                $copied_nodes += $node
            }
        }
    }
    
    if ($copied_nodes.length -gt 0) {
        # invoke the install, and wait for it to complete
        write-output ("Pushing to nodes: " + ($copied_nodes -join " ") + "$nl")
        write-output "With command : $cmd_word $cmd_args"
 
        $arg_list = @("-file", "$cmd_word") + $cmd_args
        Invoke-Command -ComputerName $copied_nodes -ScriptBlock {
            # everything in this scriptblock runs local to the node    
            $node = $env:COMPUTERNAME
            
            # launch the ps1 file
            $proc_record = Start-Process powershell.exe -ArgumentList $using:arg_list -PassThru -Wait -Verb "RunAs"
            if (! $?) {
                write-error "Start-Process call failed for $node.  Error $($error[0]).  Skipping."
                return
            }
            if ($proc_record.ExitCode -ne 0) {
                        write-error "$using:cmd_word failed for $node.  Error code $($proc_record.ExitCode).  Skipping."
                        write-error "For error code meanings, see https://msdn.microsoft.com/en-us/library/windows/desktop/aa390890(v=vs.85).aspx"
                        return
                    }
            
            write-output "Installed files to node $node $nl"
            write-output "Done."  #getting to this line indicates success
 
            } 2>&1
    }    
    write-output "$nl"
}
 
function Install-HDPDependencies( )
{
    # validate the input files
    Check-Files
    
    # Do the install.
    $all_results = @()
    write-output ("Nodes to install (parallel): " + ($required_nodes -join " ") + "$nl")
    
    # bucket the nodes for parallel installation
    $required_buckets = @()
    if ($batch_size -le 1) { $batch_size = 1 }
    if ($required_nodes.length -le $batch_size) {
        $required_buckets= ,@($required_nodes)
    } else {
        $required_buckets= for($i=0; $i -lt $required_nodes.length; $i+=$batch_size) { , $required_nodes[$i..($i+$batch_size-1)]}
    }
    
    foreach ($required_bucket in $required_buckets) {
        write-output ("Pushing to nodes: " + ($required_bucket -join " ") + "...")
        $push_results = PushInstall-Files($required_bucket) 2>&1
        if ($push_results) {
            write-output "output: $push_results"
            $all_results += $push_results
            $push_results
        }
    }                
 
    write-output "$nl$nl"
    
    # parse $all_results for failure messages and alert user
    $results = Summarize-Results $all_results
    Report-Results $results
}    
 
# Invoke the functionality of the script
Install-HDPDependencies

I will leave it up to the reader to look over what this script is doing. In short this script takes a series of files and deploys them to all nodes in the cluster. It then remotely executes a specified PowerShell script file on each node. This script forms the basis of not only deploying the configuration changes but can also be used for deploying dependency software to cluster nodes, and deploying rack awareness.

Calling this script is a bit more involved but can be managed through the following command file:

 set working_path=%~dp0
if %working_path:~-1%==\ set working_path=%working_path:~0,-1%
 
set script_path=%working_path%\push_configure_cluster_properties.ps1
set source_path=%working_path%
 
set target_root=E:\hdp-installassets
set target_path=%target_root%\configuration
 
set files_list=configure-cluster-property.ps1,configure-cluster-properties.ps1,configure-cluster-properties.cmd,configure-cluster-properties.txt
set nodes_list=%working_path%\clusternodes.txt
 
set configuration_file=configure-cluster-properties.txt
 
set cmd_word=%target_path%\configure-cluster-properties.ps1
set cmd_args=-working_path '%target_path%' -configuration_file '%configuration_file%'
 
PowerShell -NoProfile -ExecutionPolicy Bypass -Command "& '%script_path%' -source_path '%source_path%' -target_path '%target_path%' -files_list '%files_list%' -nodes_list '%nodes_list%' -batch_size 4 -cmd_word '%cmd_word%' %cmd_args%"
 
pause

The items one will need to specify are the remote location where the configuration files can be deployed to, and the list of nodes forming the cluster. The list of cluster nodes can be a file of the names, such as the one below, or just a simple comma separated list:

 HDP-NN1.ACME.LAB,HDP-NN2.ACME.LAB,HDP-MGMT1.ACME.LAB
HDP-DATA01.ACME.LAB,HDP-DATA02.ACME.LAB,HDP-DATA03.ACME.LAB,HDP-DATA04.ACME.LAB,HDP-DATA05.ACME.LAB

One last thing worth mentioning about the deployment script is that it does group the cluster nodes and make the changes to the node in parallel for each group. This parallel grouping is especially important when using the script to manage dependency software installations.

Comments

  • Anonymous
    August 07, 2014
    "Once again PowerShell provides an easy solution." - Really ? On linux you can do this with one ssh pr scp command!
  • Anonymous
    November 10, 2014
    MaxMunus Provides  HADOOP Real Time online Training by Industrial experts.If anyone interested contact- +91 9738075708 or mail at- sangita@maxmunus.com.
  • Anonymous
    January 25, 2015
    i am trying to contact the Author of this control: code.msdn.microsoft.com/.../Discussions Carl, please answer the questions regarding the key press event in your 2011 control. you have abandon it completely and we would like your help a bit there for few mins of your life! thanks and sorry to contact you here, i couldn't find another way