Share via


SharePoint 2007: PowerShell script to Identify Duplicate Files in Your MOSS Environment

I recently came across the customer requirement to check for duplicate files in their MOSS environment. The content database grows and sometimes it grows even faster if users store the same file at various locations instead of using links to documents.

Credits

As blueprint I have used this blogpost for SP2010:

http://blog.pointbeyond.com/2011/08/24/finding-duplicate-documents-in-sharepoint-using-powershell/

Background

In that post an MD5 hash is used to identify duplicates, but unfortunately the documents' MD5 hash changes as soon as properties are written from MOSS to MS Office documents. This process is called document promotion/demotion.

That is why the MD5 hash did not give much insight. In order to have a simple, but working solution I chose the documents' filename. This gives at least a first indicator for duplicates.

This Powershell script does not scale well, but delivers the appropriate results in the end. This might cause additional load on the farm. I would run this on a server that does not directly answer user requests - the index server. After completion it creates a csv file on the D: volume called duplicates.csv.

Script

I slightly changed the source script to remove nested loops which improved the duration significantly - up to 99% in my test environment.

01.``#Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue

02.``[system.reflection.assembly]::LoadWithPartialName(``"Microsoft.SharePoint"``)

03.``function Get-DuplicateFiles ($RootSiteUrl)

04.``{

05.`` #$spSite = Get-SPSite -Identity $RootSiteUrl

06.`` $spsite = ``new``-object Microsoft.SharePoint.SPSite($RootSiteUrl)

07. $Items = @()

08.`` $duplicateItems = @()

09. $duplicateshelper = @()

10.`` foreach ($SPweb ``in $spSite.allwebs)

11. {

12.`` ``Write-Host ``"Checking " $spWeb.Title ``" for duplicate documents"

13. foreach ($list ``in $spWeb.Lists)

14.`` ``{

15. if``($list.BaseType -eq ``"DocumentLibrary" -and $list.RootFolder.Url -notlike ``"_*" -and $list.RootFolder.Url -notlike ``"SitePages*"``)

16.`` ``{

17. foreach($item ``in $list.Items)

18.`` ``{

19. $record = New-Object -TypeName System.Object

20.`` ``if``($item.File.length -gt 0)

21. {

22.`` ``$record | Add-Member NoteProperty FileName ($item.file.Name)

23. $record | Add-Member NoteProperty FullPath ($spWeb.Url + ``"/" + $item.Url)

24.`` ``$Items += $record

25. }

26.`` ``}

27. }

28.`` ``}

29. $spWeb.Dispose()

30.`` }``

31.``$duplicateItems = $Items | Group-Object Filename| Where-Object {$_.Count -gt 1}

32.

33.``foreach($dup ``in $duplicateItems)

34.`` ``{

35.`` ``foreach($item ``in $Items | Where-Object {$_.Filename -eq $dup.Name})

36.`` ``{

37.`` ``if ($duplicateshelper -notcontains $item.Fullpath)

38.`` ``{

39.`` ``$duplicateshelper += $item.Fullpath

40.`` ``$found = New-Object -TypeName System.Object

41.`` ``$found | Add-Member NoteProperty Filename ($item.FileName)

42.`` ``$found | Add-Member NoteProperty Fullpath ($item.Fullpath)

43.`` ``$duplicates += $found

44.`` ``}

45.`` ``}

46. }

47.``return $duplicates | Export-Csv d:\duplicates.csv

48.``}

49.``Get-DuplicateFiles(``"http://sp2007/"