SharePoint Online: Find Duplicate Documents using PowerShell – File Hash Method 5000>

Ayesh jamal 21 Reputation points
2021-11-07T12:40:36.53+00:00

i am trying to get all duplicates files within 1 document library for lists larger than 5000 i am not able to merge the two scripts togther

please refer to the script for getting items more than 5000 https://www.sharepointdiary.com/2016/12/sharepoint-online-get-all-items-from-large-lists-powershell-csom.html

i have a small issue am not able to update the following script to support document library greater than 5000 items please if someone can guide me please refer to the following link:https://www.sharepointdiary.com/2019/04/sharepoint-online-find-duplicate-files-using-powershell.html

#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"

#Parameters
$SiteURL = "https://x.sharepoint.com"
$ListName ="Documents"

#Array to Results Data
$DataCollection = @()

#Get credentials to connect
$Cred = Get-Credential

Try {
    #Setup the Context
    $Con = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
    $Con.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)

    #Get the Web
    $Web = $Con.Web
    $Con.Load($Web)

    #Get all List items from the library - Exclude "Folder" objects
    $List = $Con.Web.Lists.GetByTitle($ListName)
    $Query = New-Object Microsoft.SharePoint.Client.CamlQuery
    $Query.ViewXml="<View Scope='RecursiveAll'><Query><Where><Eq><FieldRef Name='FSObjType'/><Value Type='Integer'>0</Value></Eq></Where></Query></View>"
    $ListItems = $List.GetItems($Query)
    $Con.Load($ListItems)
    $Con.ExecuteQuery()

    $Count=1
    ForEach($Item in $ListItems)
    {
        #Get the File from Item
        $File = $Item.File
        $Con.Load($File)
        $Con.ExecuteQuery()
        Write-Progress -PercentComplete ($Count / $ListItems.Count * 100) -Activity "Processing File $count of $($ListItems.Count)" -Status "Scanning File '$($File.Name)'"

        #Get The File Hash
        $Bytes = $Item.file.OpenBinaryStream()
        $Con.ExecuteQuery()
        $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
        $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))

        #Collect data       

        $Data = New-Object PSObject
        $Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name
        $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
        $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
        $DataCollection += $Data


        $Count++
    }
    #$DataCollection
    #Get Duplicate Files
    $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    If($Duplicates.Count -gt 1)
    {

        $Duplicates | Out-GridView


    }
    Else
    {
        Write-host -f Yellow "No Duplicates Found!"
    }
}
Catch {
    write-host -f Red "Error:" $_.Exception.Message
}


$Duplicates | export-csv -Path c:\tmp\so.csv -NoTypeInformation
SharePoint
SharePoint
A group of Microsoft Products and technologies used for sharing and managing content, knowledge, and applications.
11,039 questions
Windows Server PowerShell
Windows Server PowerShell
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.PowerShell: A family of Microsoft task automation and configuration management frameworks consisting of a command-line shell and associated scripting language.
5,598 questions
{count} votes

Accepted answer
  1. Rich Matheisen 47,496 Reputation points
    2021-11-07T15:19:13.403+00:00

    Try changing the query to take smaller batches instead of trying to eat the results all at once!

    $BatchSize = 1000
    $Query.ViewXml=@"
    <View Scope='RecursiveAll'>
        <Query>
            <Where><Eq><FieldRef Name='FSObjType'/><Value Type='Integer'>0</Value></Eq></Where>
        </Query></View>"
        <RowLimit Paged="TRUE">$BatchSize</RowLimit>
    "@
    

1 additional answer

Sort by: Most helpful
  1. Elsie Lu_MSFT 9,796 Reputation points
    2021-11-08T06:54:21.63+00:00

    Hi @Rich Matheisen , welcome to Q&A forum!

    Per my test, this script mentioned above can work normally in my environment, it can detect duplicate files. Are there any issues or error messages in your environment?

    #Load SharePoint CSOM Assemblies  
    Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"  
    Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"  
       
    #Parameters  
    $SiteURL = "https://crescenttech.sharepoint.com"  
    $ListName ="Documents"  
       
    #Array to Results Data  
    $DataCollection = @()  
       
    #Get credentials to connect  
    $Cred = Get-Credential  
       
    Try {  
        #Setup the Context  
        $Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)  
        $Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)  
       
        #Get the Web  
        $Web = $Ctx.Web  
        $Ctx.Load($Web)  
       
        #Get all List items from the library - Exclude "Folder" objects  
        $List = $Ctx.Web.Lists.GetByTitle($ListName)  
        $Query = New-Object Microsoft.SharePoint.Client.CamlQuery  
        $Query.ViewXml="<View Scope='RecursiveAll'><Query><Where><Eq><FieldRef Name='FSObjType'/><Value Type='Integer'>0</Value></Eq></Where></Query></View>"  
        $ListItems = $List.GetItems($Query)  
        $Ctx.Load($ListItems)  
        $Ctx.ExecuteQuery()  
           
        $Count=1  
        ForEach($Item in $ListItems)  
        {  
            #Get the File from Item  
            $File = $Item.File  
            $Ctx.Load($File)  
            $Ctx.ExecuteQuery()  
            Write-Progress -PercentComplete ($Count / $ListItems.Count * 100) -Activity "Processing File $count of $($ListItems.Count)" -Status "Scanning File '$($File.Name)'"  
       
            #Get The File Hash  
            $Bytes = $Item.file.OpenBinaryStream()  
            $Ctx.ExecuteQuery()  
            $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider  
            $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))  
       
            #Collect data         
            $Data = New-Object PSObject  
            $Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name  
            $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode  
            $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl  
            $DataCollection += $Data  
       
            $Count++  
        }  
        #$DataCollection  
        #Get Duplicate Files  
        $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group  
        If($Duplicates.Count -gt 1)  
        {  
            $Duplicates | Out-GridView  
        }  
        Else  
        {  
            Write-host -f Yellow "No Duplicates Found!"  
        }  
    }  
    Catch {  
        write-host -f Red "Error:" $_.Exception.Message  
    }  
    

    Test Result:
    147248-112.png


    If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.