Udostępnij za pośrednictwem


Use Hash Tables To Go Faster Than PowerShell Compare-Object

Compare-Object gotcha down? Slower than my old 300 baud modem? Have no fear. Today we go faster using hash tables.

Let me state first that I love the cmdlet Compare-Object, and I have used it many times with great results. But at scale my customer had some serious performance issues.

The Problem - “I feel the need. The need for speed.”

So my customer has employed all the tricks from the last blog post on making your scripts go faster. But still the script takes hours to run. Between each command he dropped a timestamp into a log file. The culprit… Compare-Object. That single command was taking hours.

But let’s be fair. He’s comparing about 800,000 email addresses between two lists. It would take me weeks to do that by hand with a pencil and paper. Compare-Object is pretty quick at 13 hours. But let’s get this down to seconds.

The Research

First things first. What exactly is Compare-Object doing? To find out, view the source code over at the PowerShell open source GitHub location. So I did that. But I’m not a .NET developer. However, I did notice the comments starting on line 120 helped me understand what it does. That is very similar to my idea.

All I know is that when I want list processing to go faster in PowerShell I use hash tables. I’ll write my own version in native PowerShell and see if it is faster.

The Approach

We have two lists, and we need to know what is different. We want to make the most efficient use of both memory and computation.

If I compare every item in List1 against every item in List2, well, that’s going to take a while (n*m).

Each list comes in as an array. I need to look up all the items in List1 against List2. The fastest way to do lookups is with a hash table.

To find the differences, I will delete the matching entries from both List1 and List2. Arrays are slow at removing a single item, so again I will use hash tables.

After deleting all of the equal values, the only things left in each list are the unique values.

If you want to see what is equal, then I will stuff that into a third list (hash table) containing only the equal values.

The Code

I have placed the hash table comparison into a function called Compare-Object2.

 <#
.SYNOPSIS
Faster version of Compare-Object for large data sets with a single value.
.DESCRIPTION
Uses hash tables to improve comparison performance for large data sets.
.PARAMETER ReferenceObject
Specifies an array of objects used as a reference for comparison.
.PARAMETER DifferenceObject
Specifies the objects that are compared to the reference objects.
.PARAMETER IncludeEqual
Indicates that this cmdlet displays characteristics of compared objects that
are equal. By default, only characteristics that differ between the reference
and difference objects are displayed.
.PARAMETER ExcludeDifferent
Indicates that this cmdlet displays only the characteristics of compared
objects that are equal.
.EXAMPLE
Compare-Object2 -ReferenceObject 'a','b','c' -DifferenceObject 'c','d','e' `
    -IncludeEqual -ExcludeDifferent
.EXAMPLE
Compare-Object2 -ReferenceObject (Get-Content .\file1.txt) `
    -DifferenceObject (Get-Content .\file2.txt)
.EXAMPLE
$p1 = Get-Process
notepad
$p2 = Get-Process
Compare-Object2 -ReferenceObject $p1.Id -DifferenceObject $p2.Id
.NOTES
Does not support objects with properties. Expand the single property you want
to compare before passing it in.
Includes optimization to run even faster when -IncludeEqual is omitted.
#>            
function Compare-Object2 {            
param(            
    [psobject[]]            
    $ReferenceObject,            
    [psobject[]]            
    $DifferenceObject,            
    [switch]            
    $IncludeEqual,            
    [switch]            
    $ExcludeDifferent            
)            
            
    # Put the difference array into a hash table,            
    # then destroy the original array variable for memory efficiency.            
    $DifHash = @{}            
    $DifferenceObject | ForEach-Object {$DifHash.Add($_,$null)}            
    Remove-Variable -Name DifferenceObject            
            
    # Put the reference array into a hash table.            
    # Keep the original array for enumeration use.            
    $RefHash = @{}            
    for ($i=0;$i -lt $ReferenceObject.Count;$i++) {            
        $RefHash.Add($ReferenceObject[$i],$null)            
    }            
            
    # This code is ugly but faster.            
    # Do the IF only once per run instead of every iteration of the ForEach.            
    If ($IncludeEqual) {            
        $EqualHash = @{}            
        # You cannot enumerate with ForEach over a hash table while you remove            
        # items from it.            
        # Must use the static array of reference to enumerate the items.            
        ForEach ($Item in $ReferenceObject) {            
            If ($DifHash.ContainsKey($Item)) {            
                $DifHash.Remove($Item)            
                $RefHash.Remove($Item)            
                $EqualHash.Add($Item,$null)            
            }            
        }            
    } Else {            
        ForEach ($Item in $ReferenceObject) {            
            If ($DifHash.ContainsKey($Item)) {            
                $DifHash.Remove($Item)            
                $RefHash.Remove($Item)            
            }            
        }            
    }            
            
    If ($IncludeEqual) {            
        $EqualHash.Keys | Select-Object @{Name='InputObject';Expression={$_}},`
            @{Name='SideIndicator';Expression={'=='}}            
    }            
            
    If (-not $ExcludeDifferent) {            
        $RefHash.Keys | Select-Object @{Name='InputObject';Expression={$_}},`
            @{Name='SideIndicator';Expression={'<='}}            
        $DifHash.Keys | Select-Object @{Name='InputObject';Expression={$_}},`
            @{Name='SideIndicator';Expression={'=>'}}            
    }            
}            

Note that for my purposes I did not need to compare multiple properties, so this approach does not entirely duplicate functionality of the native Compare-Object. You could probably adapt this code for that purpose. I would drop each list object into a hash table value, while making the key a string representation of the one or more properties to be compared. I’ll leave that bit up to you.

Also note that, yes, I used ForEach. General consensus is that ForEach is slower than For. Feel free to adjust and see if that makes a difference in execution time for you.

The Results

 # Native Compare-Object            
Measure-Command -Expression {            
    Compare-Object -ReferenceObject (Get-Content .\file1.txt) `
        -DifferenceObject (Get-Content .\file2.txt) -IncludeEqual            
} | Select-Object TotalMilliseconds            
            
# Hash table comparison            
Measure-Command -Expression {            
    Compare-Object2 -ReferenceObject (Get-Content .\file1.txt) `
        -DifferenceObject (Get-Content .\file2.txt) -IncludeEqual            
} | Select-Object TotalMilliseconds

When racing the native Compare-Object against my hash table implementation here are the results:

  • For test lists of 1,000 items, Compare-Object finishes in five seconds while the hash table version finishes in <1 second.
  • For test lists of 100,000 items, the hash table finishes in five seconds while Compare-Object had not finished after multiple minutes (so I just killed the task).
  • For the customer’s 800,000 items, the hash table finished in 30 minutes, as opposed to 13 hours for Compare-Object. To be fair, the script does other tasks besides this Compare-Object. Regardless that is a 25x performance improvement!

How is that for efficiency gain?!

The Moral of the Story

Learn hash tables today! They are the single most versatile, powerful, and fun data structure in all of PowerShell. Let me know your results in the comments area below.

“Goose, it’s time to buzz the tower.”

Updated Code

Edit - Aug 29, 2017 - Find an updated version of this script that takes multiple properties for comparison on my GitHub here.

Comments

  • Anonymous
    August 08, 2017
    1 min to ETL the data into SQL via SSIS / other methods<1min to SQL Compare via T-SQL = 2 mins vs 30 minsJob done...
    • Anonymous
      August 11, 2017
      @guest: That's not an appropriate comparison.You missing the cost of hardware, license for windows, license for SQL or an equivalent MS agreement. Then its the cost of having IT/DBA that know how to install, setup, maintain and they write the T-SQL to do the compare.its like saying:On my 256 core /1TBram I did all at under 0.5 sec vs your 2 min...at what cost ?
  • Anonymous
    August 31, 2017
    Unfortunately this cmdlet will throw exception when ReferenceObject or DifferenceObject contains non-unique values (hashtable keys have to be unique).
    • Anonymous
      September 07, 2017
      You can use smth like this:$DifHash.$_ = $null or$DifHash.$_ ++ - this will count how many duplicate values you havenot sure, though, how this change will impact memory\time consumption.