SharePoint Online: Find Duplicate Documents using PowerShell

44785390 0 Reputation points
2025-01-07T17:04:14.4366667+00:00

Could someone post a working script for getting a list of duplicate files in Sharepoint Online?

The scripts formerly posted no longer work.

SharePoint Development
SharePoint Development
SharePoint: A group of Microsoft Products and technologies used for sharing and managing content, knowledge, and applications.Development: The process of researching, productizing, and refining new or existing technologies.
3,179 questions
PowerShell
PowerShell
A family of Microsoft task automation and configuration management frameworks consisting of a command-line shell and associated scripting language.
2,723 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. amcsaid 881 Reputation points
    2025-01-07T19:44:43.83+00:00

    Hey there,

    I would approach your problem by creating a script that generates a CSV with all files in a given SharePoint site (or folder(s)) and export the results.

    With the CSV you can detect the duplicates based on file names, sizes and creation date and other metadata.

    You can then use those results as you please.
    Goodluck!


  2. Ling Zhou_MSFT 20,165 Reputation points Microsoft Vendor
    2025-01-08T02:38:36.0366667+00:00

    Hi @44785390,

    Thanks for reaching out to us. We are very pleased to assist you.

    We can use PnP PowerShell to find duplicate files in SharePoint Online.

    1. Preparations:
      • If you have not installed PnP PowerShell before, please install PnP PowerShell first. Note: Microsoft is providing this information as a convenience to you. The sites are not controlled by Microsoft. Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. Please make sure that you completely understand the risk before retrieving any suggestions from the above link. 
    2. This PowerShell script scans all files from all document libraries in a site and extracts the File Name, File Hash, and Size parameters for comparison to output a CSV report with all data. Please don't forget to replace your parameters.
    #Parameters
    $SiteURL = "https://contoso.sharepoint.com/sites/24Dec"
    $Pagesize = 2000
    $ReportOutput = "C:\Duplicates.csv"
    $ClientId = "7cda65de-xxxxx-b2-a485-bf6e2a70a909"
     
    #Connect to SharePoint Online site
    Connect-PnPOnline $SiteURL -ClientId $ClientId -Interactive 
      
    #Array to store results
    $DataCollection = @()
     
    #Get all Document libraries
    $DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin("Site Pages","Style Library", "Preservation Hold Library")}
     
    #Iterate through each document library
    ForEach($Library in $DocumentLibraries)
    {   
        #Get All documents from the library
        $global:counter = 0;
        $Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock `
            { Param($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
                 "Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
       
        $ItemCounter = 0
        #Iterate through each document
        Foreach($Document in $Documents)
        {
            #Get the File from Item
            $File = Get-PnPProperty -ClientObject $Document -Property File
     
            #Get The File Hash
            $Bytes = $File.OpenBinaryStream()
            Invoke-PnPQuery
            $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
            $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
      
            #Collect data       
            $Data = New-Object PSObject
            $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
            $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
            $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
            $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length       
            $DataCollection += $Data
            $ItemCounter++
            Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
                         -Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
        }
    }
    #Get Duplicate Files by Grouping Hash code
    $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Duplicate Files Based on File Hashcode:"
    $Duplicates | Format-table -AutoSize
    #Group Based on File Name
    $FileNameDuplicates = $DataCollection | Group-Object -Property FileName | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Potential Duplicate Based on File Name:"
    $FileNameDuplicates| Format-table -AutoSize
    #Group Based on File Size
    $FileSizeDuplicates = $DataCollection | Group-Object -Property FileSize | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Potential Duplicates Based on File Size:"
    $FileSizeDuplicates| Format-table -AutoSize
     
    #Export the duplicates results to CSV
    $DataCollection | Export-Csv -Path $ReportOutput -NoTypeInformation
    

    If you have any questions, please do not hesitate to contact me.

    Moreover, if the issue can be fixed successfully, please click "Accept Answer" so that we can better archive the case and the other community members who are suffering the same issue can benefit from it.

    Your kind contribution is much appreciated.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.