SharePoint: Using PowerShell to Group Filter and Sort Documents by Metadata
Introduction
You can use PowerShell to group documents (or list items) based on metadata. You can use grouping to find documents that are duplicates, to find the most active authors for a certain metadata (field) combination (i.e. group by author, then by document type), or to create a report on metadata combinations.
ListViews can be used to produce some of this functionality, but won't work in some scenarios. One such scenario is grouping documents/listitems by certain metadata (e.g. customer id, document type), then only displaying a list of groups that contain more than one document/listitem for a given combination of metadata (fields).
The PowerShell functions that are required to perform this sort of task are all standard PowerShell cmdlets (out of the box). Most of the following script examples make use of three PowerShell cmdlets, Group-Object, Where-Object, Sort-Object.
In the examples below, we are querying a marketing list that contains hundreds of marketing publications and submissions. The document library has a number of metadata fields used to describe those documents, including the following fields:
* Marketing Document Type (the type of document. E.g. Publication, Submission, etc.)
* Year (the year the publication or submission relates to)
* Group (the business group)
The examples below demonstrate a number of business cases, an example of the script used and the report that gets created.
The basic structure of each example uses piping, to pass collections between cmdlets. The collections of items are grouped, filtered, sorted, and counted along the way.
Getting Started
All of the examples below use a DataTable as the source list of objects that get passed to the Group-Object cmdlet, which we can get from the following lines of code:
Get the collection of items from the document library and store it in a variable.
$w = Get-SPWeb "http://corporation/marketing";
$l = $w.Lists["Legal Directories"];
$items = $l.Items;
Get all of the items in the listitem collection, return it as a DataTable and store it in the $dt variable.
$dt = $items.GetDataTable();
Store the static names of the fields we'll be using to group and sort by, into variables (to keep the scripts manageable and make the script commands shorter).
$fAuthor = "Author";
$fYear = "iYear";
$fDocType = "iMarketingDocumentClassification";
$fGroup = "iStrand";
Before we hop into the examples, let's take a quick look at each object returned along the pipeline, what it's type is, what properties are exposed, and most importantly, how we use them.
The example we are going to use is grouping by the Author, then Document Type fields. Using the object returned by the Group-Object cmdlet, we are sorting on the Document Type property, then the Count property. The final collection of grouped and sorted objects is passed into the $ad variable. We then pass the $ad variable to Format-Table, displaying the following columns; Count, Authors Name, Document Type.
The first part of the example is piping the DataTable object to the Group-Object cmdlet, and grouping it by the Author field, then by the Document Type field.
$go = $dt | Group-Object $fAuthor,$fDocType
If we have a look at the type of object returned, we see that the Group-Object cmdlet has given us a System.Array full of Objects.
If we list out the contents of $go, we see the three properties; Count (the number of items in the group), Name (the values of each set of grouped fields) and Group (containing the DataRows in the group).
When we look at the first object in the collection of objects in $go, we see the type is GroupInfo (Microsoft.PowerShell.Commands.GroupInfo). The documentation for the GroupInfo class can be seen here, GroupInfo. The GroupInfo class actually has a forth property, Values, which contains the values of the elements in the group (seen in the screen shot below, as well as in the MSDN Documentation).
Now that we have grouped our list items, and have them in an Array full of GroupInfo objects, we need to sort them into an order ready for displaying. In this example, we want to sort them by Document Type, and then by the Count of each group.
$ad = $go | Sort-Object -Property {$_.Values[1]},{$_.Count} -Descending
We do this by piping the results from Group-Object to Sort-Object, and specifying the properties to sort on.
The first property we sort on is $_.Values[1]. $_ represents the current GroupInfo object in the pipeline. We know the GroupInfo object has the Value property, which contains the values of the elements in the group (from the screen, we can see is the Author and Document Type field values).
The second field we are sorting on, is $_.Count. $_, as just discussed, is a GroupInfo object, so we know it has a Count property that contains the number of elements in the group.
Finally, we add the -Descending switch to Sort-Object, to order the results in descending order. The results from Sort-Object (an Array of GroupInfo objects) are then returned and stored in the $ad variable.
Now that we have our results grouped and sorted the way we want them, we can display the results in a table format, using the Format-Table cmdlet. However, to display the columns we want, we need to use formatting instructions to get the values of Document Type and Author, because these values are not direct properties of the GroupInfo object (the GroupInfo object's properties are Count, Group, Name, and Values). If you're interested in learning more about formatting instructions, there's a great article on the Microsoft TechNet Scripting site that is worth reading, Creating Custom Tables.
$ad | FT Count,@{Label="Name";Expression={$_.Values[0]}},@{Label="Document Type";Expression={$_.Values[1]}} -Autosize
The full script can be condensed a little more, piping the DataTable to Group-Object, piping the Array of GroupInfo (returned from Group-Object) to Sort-Object, and finally piping the results to Format-Table for displaying.
$dt | Group-Object $fAuthor,$fDocType | Sort-Object -Property {$_.Values[1]},{$_.Count} -Descending | FT Count,@{Label="Name";Expression={$_.Values[0]}},@{Label="Document Type";Expression={$_.Values[1]}} -Autosize
Examples
The following examples build on the script above, using the same DataTable ($dt) variable and fields.
Example One: Listing Authors who Frequently Create Documents
Group documents by Author, Then by Document Type. Then get all the groups that contain five or more documents with the same author and document type value, and write them out to screen, including each documents ListItemId.
This report will list the people who regularly add documents of certain types (e.g. publications) to the marketing document library
$ad = $dt | Group-Object $fAuthor,$fDocType | Where-Object{$_.Count -gt 5}
foreach($d in $ad){$d.Group | Format-Table @{Label="Name";Expression={$_["FileLeafRef"]}},@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Type";Expression={$_["iMarketingDocumentClassification"]}},@{Label="Author";Expression={$_["Author"]}}}
Example Two: Looking for Potential Duplicates by Author
Group documents by Author, Then by Document Type, then by Business Group. Then get all the groups with the same author, document type value and business group, where the group contains more than one item (e.g. a potential duplicate) and write them out to screen, including each documents ListItemId.
This is an example of something a SharePoint List View can't achieve. A list view can do the grouping, but can't filter out groups that only contain one document.
$ad = $dt | Group-Object $fAuthor,$fDocType,$fGroup | Where-Object{$_.Count -gt 1}
foreach($d in $ad){$d.Group | Format-Table @{Label="Name";Expression={$_["FileLeafRef"]}},@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Type";Expression={$_["iMarketingDocumentClassification"]}},@{Label="Group";Expression={$_["iStrand"]}},@{Label="Author";Expression={$_["Author"]}}}
Example Three: Looking for Duplicate Submissions
Filter the documents by the document type, then Group the documents by Year, Then by Group. Then get all the groups that contain more than 1 (submission) documents with the same year for the same business group, and write them out to screen, include each documents ListItemId
$ad = $dt | ?{$_[$fDocType] -eq "Directory Submission"} | Group-Object $fYear,$fGroup | Where-Object{$_.Count -gt 1 }
foreach($d in $ad){$d.Group | Format-Table @{Label="Name";Expression={$_["FileLeafRef"]}},@{Label="Item Id";Expression={$_["ID"]}},@{Label="Business Group";Expression={$_["iStrand"]}},@{Label="Year";Expression={$_["iYear"]}}}
Example Four: Looking for Duplicate Documents (Same Document Type, Year and Business Group)
Group documents by Year, Then by Document Type, Then by Group. Then get all the groups that contain two or more documents with the same year, document type and business group values (identifying the potential duplicates), and write them out to screen, include each documents ListItemId.
$ad = $dt | Group-Object $fYear,$fDocType,$fGroup | Where-Object{$_.Count -gt 1}
foreach($d in $ad){$d.Group | Format-Table @{Label="Name";Expression={$_["FileLeafRef"]}},@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Type";Expression={$_["iMarketingDocumentClassification"]}},@{Label="Year";Expression={$_["iYear"]}},@{Label="Group";Expression={$_["iStrand"]}},@{Label="Author";Expression={$_["Author"]}}}
Continuing to build on the previous example where we have identified potential duplicate documents, we will add a new column for tracking the potential duplicate documents, and set the value of that column to true for all the identified documents. This field can then be used to create a List View in the SharePoint UI, that a person from the Marketing team can use to review the suspect documents and take an appropriate action.
First, we create the new field, then refresh the list reference and DataTable
$l.Fields.Add("PotentialDuplicate", [Microsoft.SharePoint.SPFieldType]::Boolean , $false);
$l.Update();
$items = $l.Items;
$dt = $items.GetDataTable();
Next, using the items we grouped and sorted into the $ad variable in example four, we loop through the collection of group objects, looping through each group object, finally updating the PotentialDuplicate field of each item.
foreach($group in $ad){foreach($item in $group.Group){$li = $l.Items.GetItemById($item.ID); $li["PotentialDuplicate"] = $true;$li.Update();}}
Summary
Well, that's it! There's lots of PowerShell in there, but once you get a feel for the GroupInfo object collection that the Group-Object cmdlet returns, and how to use it with Sort-Object, Where-Object and Format-Table, it's really easy to group, filter, sort and count item collections!
See Also
- Using Group-Object
- Group-Object (Documentation)
- Where-Object (Documentation)
- Using Sort-Object
- Sort-Object (Documentation)
- Format-Table (Documentation)
- Microsoft.PowerShell.Commands.GroupInfo (Documentation)