Jaa


Duplicate Files 2

A long time ago I posted a filter (AddNote) for adding notes to objects. Some time later I posted a function (Get-MD5) for calculating the MD5 hash of a file and somebody asked how that could be used in a script to list all the files in a given folder that are very likely the same. I like that question because the answer it allows me to combine both these functions in a way I find pretty neat. First of all, lets create another filter called AttachMD5.

filter AttachMD5

{

  $md5hash = Get-MD5 $_;

  return ($_ | AddNote MD5 $md5Hash);

}

The filter expects to get a [System.IO.FileInfo] object via the pipeline. It will calculate its MD5 hash, use the AddNote function to add the hash as a note called MD5 and finally it will return the object.

MSH>$foo = dir test.txt | AttachMD5

MSH>$foo.MD5

216 129 182 155 10 202 51 188 245 219 199 220 92 68 140 194

MSH>

Now we have all the pieces we need to write a script that will tell us if there are any files that are very likely duplicates. The plan is to get a list of fileinfo objects, attach the MD5 to each one, then group by length and MD5 and finally print out all the groups that have more than one item. Here is one way to do that:

$input |

  where { $_ -is [System.IO.FileInfo] } |

  AttachMD5 |

  group-object Length,MD5 |

  where { $_.Count -gt 1 } |

  foreach { "$($_.Group | foreach { $_.FullName } )" }

Take that bit and copy it into a script along with the other functions and filters and lets try it out.

MSH>"abc" > a.txt

MSH>"xyz" > b.txt

MSH>"abc" > c.txt

MSH>"xyz" > d.txt

MSH>"jkl" > e.txt

MSH>"abc" > f.txt

MSH>dir | c:\monad\getdups.msh

C:\temp\a.txt C:\temp\c.txt C:\temp\f.txt

C:\temp\b.txt C:\temp\d.txt

MSH>

If we wanted to find all the very likely duplicate files in a directory structure we could just recurse through it and pipe it to the script:

MSH>dir . -recurse | c:\monad\getdups.msh

Now… you should know that this script isn’t exactly the most performant thing in the world. After all, it’s calculating the MD5 hash for all the files which isn't really neccesary. I’ll leave improving the performance as an exercise for you guys. One quick way to improve performance would be to group via Length first, discard all those groups that donn’t have more than 1 and only then calculate the MD5. Want to measure if you are really improving performance? Give the time-expression cmdlet a try.

MSH>time-expression { dir | getdups.msh }