SharePoint 2007: PowerShell script to Identify Duplicate Files in Your MOSS Environment
I recently came across the customer requirement to check for duplicate files in their MOSS environment. The content database grows and sometimes it grows even faster if users store the same file at various locations instead of using links to documents.
Credits
As blueprint I have used this blogpost for SP2010:
http://blog.pointbeyond.com/2011/08/24/finding-duplicate-documents-in-sharepoint-using-powershell/
Background
In that post an MD5 hash is used to identify duplicates, but unfortunately the documents' MD5 hash changes as soon as properties are written from MOSS to MS Office documents. This process is called document promotion/demotion.
That is why the MD5 hash did not give much insight. In order to have a simple, but working solution I chose the documents' filename. This gives at least a first indicator for duplicates.
This Powershell script does not scale well, but delivers the appropriate results in the end. This might cause additional load on the farm. I would run this on a server that does not directly answer user requests - the index server. After completion it creates a csv file on the D: volume called duplicates.csv.
Script
I slightly changed the source script to remove nested loops which improved the duration significantly - up to 99% in my test environment.
01.``#Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue
02.``[system.reflection.assembly]::LoadWithPartialName(``"Microsoft.SharePoint"``)
03.``function
Get-DuplicateFiles ($RootSiteUrl)
04.``{
05.`` #$spSite = Get-SPSite -Identity $RootSiteUrl
06.`` $spsite = ``new``-object Microsoft.SharePoint.SPSite($RootSiteUrl)
07.
$Items = @()
08.`` $duplicateItems = @()
09.
$duplicateshelper = @()
10.`` foreach ($SPweb ``in
$spSite.allwebs)
11.
{
12.`` ``Write-Host ``"Checking "
$spWeb.Title ``" for duplicate documents"
13.
foreach ($list ``in
$spWeb.Lists)
14.`` ``{
15.
if``($list.BaseType -eq ``"DocumentLibrary"
-and $list.RootFolder.Url -notlike ``"_*"
-and $list.RootFolder.Url -notlike ``"SitePages*"``)
16.`` ``{
17.
foreach($item ``in
$list.Items)
18.`` ``{
19.
$record = New-Object -TypeName System.Object
20.`` ``if``($item.File.length -gt 0)
21.
{
22.`` ``$record | Add-Member NoteProperty FileName ($item.file.Name)
23.
$record | Add-Member NoteProperty FullPath ($spWeb.Url + ``"/"
+ $item.Url)
24.`` ``$Items += $record
25.
}
26.`` ``}
27.
}
28.`` ``}
29.
$spWeb.Dispose()
30.`` }``
31.``$duplicateItems = $Items | Group-Object Filename| Where-Object {$_.Count -gt 1}
32.
33.``foreach($dup ``in
$duplicateItems)
34.`` ``{
35.`` ``foreach($item ``in
$Items | Where-Object {$_.Filename -eq $dup.Name})
36.`` ``{
37.`` ``if
($duplicateshelper -notcontains $item.Fullpath)
38.`` ``{
39.`` ``$duplicateshelper += $item.Fullpath
40.`` ``$found = New-Object -TypeName System.Object
41.`` ``$found | Add-Member NoteProperty Filename ($item.FileName)
42.`` ``$found | Add-Member NoteProperty Fullpath ($item.Fullpath)
43.`` ``$duplicates += $found
44.`` ``}
45.`` ``}
46.
}
47.``return
$duplicates | Export-Csv d:\duplicates.csv
48.``}
49.``Get-DuplicateFiles(``"
http://sp2007/
"