Learn about search and analytics settings in eDiscovery cases

Article
03/05/2025

You can configure settings for each eDiscovery case to control the following functionality:

Near duplicates and email threading
Themes
Autogenerated review set query
Ignore text
Optical character recognition

Tip

Get started with Microsoft Security Copilot to explore new ways to work smarter and faster using the power of AI. Learn more about Microsoft Security Copilot in Microsoft Purview.

Configure analytics settings for a case

To configure search and analytics settings for a case:

Go to the Microsoft Purview portal and sign in using the credentials for a user account assigned eDiscovery permissions.
Select the eDiscovery solution card and then select Cases in the left nav.
Select a case, then select Case settings.
On the Case settings page, select Search & analytics.
The case Search & analytics page is displayed. These settings are applied to all review sets in a case.
After selecting the applicable search and analytics options, select Save.

The following sections in this article describe the analytics settings that you can configure for a case.

Near duplicates and email threading

In this section, you can set parameters for duplicate detection, near duplicate detection, and email threading.

Near duplicates/email threading: When turned on, duplicate detection, near duplicate detection, and email threading are included as part of the workflow when you run analytics on the data in a review set.
Document and email similarity threshold: If the similarity level for two documents is over the threshold, both documents are put in the same near duplicate set.
Minimum/maximum number of words: These settings specify that near duplicates and email threading analysis are performed only on documents that have at least the minimum number of words and at most the maximum number of words.

Near duplicate detection

Consider a set of documents to be reviewed in which a subset is based on the same template and has mostly the same boilerplate language, with a few differences here and there. If a reviewer could identify this subset, review one of them thoroughly, and review the differences for the rest, they wouldn't miss any unique information while taking only a fraction of time it would take to read all documents cover to cover. Near duplicate detection groups textually similar documents together to help you make your review process more efficient.

When near duplicate detection is run, the system parses every document with text. Then, it compares every document against each other to determine whether their similarity is greater than the set threshold. If it is, the documents are grouped together. Once all documents are compared and grouped, a document from each group is marked as the "pivot"; in reviewing your documents, you can review a pivot first and review the other documents in the same near duplicate set, focusing on the difference between the pivot and the document that is in review.

Email threading

Consider an email conversation that has been going on for a while. In most cases, the last message in the email thread includes the contents of all the preceding messages. Therefore, reviewing the last message gives a complete context of the conversation that happened in the thread. Email threading identifies such messages so that reviewers can review a fraction of collected documents without losing any context.

Email threading in eDiscovery is the process of organizing a sequence of related emails that are part of the same conversation. This includes the initial email and all subsequent replies and forwards linked to the original email. By grouping these emails into threads, reviewers see the entire context of a conversation, making it easier to understand the flow of communication. This approach helps identify relevant information more efficiently and eliminates the need to review each email individually. Email messages included in the analytics process have the following metadata populated:

Is Inclusive: This field identifies whether an email contains all the unique content from a thread, including all previous replies. It ensures that only the most comprehensive email in a thread is reviewed, which is essential for understanding the full context of the conversation without having to review each individual reply.
Has Unique Attachments: This field marks emails that contain attachments not found in other emails within the same thread. Even if the email content is duplicated, unique attachments are flagged to ensure that all relevant documents are reviewed. This is important in the legal review process to ensure that no unique evidence is overlooked, even if the email body itself is not unique.

How is it different from conversations in Outlook?

At a glance, this sounds similar to conversation groupings in Outlook. However, there are some important distinctions. Consider an email conversation that got forked into two conversations; for instance, someone responded to an email that isn't the latest in the conversation so the last two emails in the conversation both have unique content.

Outlook would still group the emails into a single conversation; reading only the last email might miss the context of the second-to-last email, which also contains unique content. Because email threading parses out each email into individual components and compares them, email threading would mark both of the last two emails as inclusive, ensuring that you won't miss any context as long as you read all emails marked as inclusive.

Let's also consider an email thread with multiple replies, where some replies include inline responses that modify the quoted content. If an inline reply alters part of the previous email, the latest reply doesn't fully encompass the content of the earlier email. Both the latest reply and the earlier email with unique content are marked as inclusive. This approach ensures that any unique information from the inline reply is preserved and not overlooked.

Themes

In this section, you can set the following parameters for themes:

Themes: When turned on, themes clustering is performed as part of the workflow when you run analytics on the data in a review set.
Maximum number of themes: Specifies the maximum number of themes that can be generated when you run analytics on the data in a review set.
Include numbers in themes: When turned on, numbers (that identifies a theme) are included when generating themes.
Adjust maximum number of themes dynamically: In certain situations, there might not be enough documents in a review set to produce the desired number of themes. When this setting is enabled, eDiscovery adjusts the maximum number of themes dynamically rather than attempting to enforce the maximum number of themes.

When you create a new document, you generally start with one or more ideas that you want to convey in the document, and then compose the document using words that align with these ideas. The more prevalent an idea is, the more frequent the words that are related to that idea tend to be. This method also aligns to how readers consume documents. The important things to understand from reading a document are the main ideas that the document is trying to convey. This also includes which ideas appear where and what the relationships between the ideas are.

This process can be extended to how an eDiscovery reviewer wants to consume a set of documents in a case. They want to see which ideas are present in the review sets and which documents are talking about those ideas. If they find a particular document of interest, they want to be able to see documents that discuss similar ideas.

The Themes functionality in eDiscovery attempts to mimic how humans reason about documents, by analyzing the themes that are discussed in a review set and assigning a theme to documents in the review set. In eDiscovery, Themes goes one step further and identifies the dominant theme in each review set and document. The dominant theme is the one that appears the most often in a document.

How do themes work?

The Themes functionality analyzes documents with text in a review set to parse out common themes that appear across all the documents in the review set. eDiscovery assigns those themes to the documents in which they appear. It also labels each theme with the words used in the documents that are representative of the theme. Because a document can contain various types of subject matter, eDiscovery often assigns multiple themes to review sets and documents. This is referred to as the Themes list. The theme that appears most prominently in a review set or document is designated as its dominant theme.

Configuring Themes

Themes are supported for cases and apply to all the review sets within them. You can configure the settings for themes when you create a new case or you can update the theme settings for an existing case.

To configure themes in a case, complete the following steps:

Go to the Microsoft Purview portal and sign in using the credentials for a user account assigned eDiscovery permissions.
Select the eDiscovery solution card and then select Cases (preview) in the left nav.
Select a case, the select Case settings.
On the Case settings page, select Search & analytics.
Select the following theme options as applicable:
- Max number of themes: Specifies the maximum number of themes that can be generated when you run analytics on the data in review sets included in a case. For more information on limits, see Limits in eDiscovery.
- Include numbers in themes: Numbers (that identify a theme) are included when generating themes.
- Adjust maximum number of themes dynamically: In certain situations, there might not be enough documents in a review set to produce the desired number of themes for the case. When this setting is enabled, the maximum number of themes is adjusted dynamically rather than attempting to enforce the maximum number of themes.
If you need to exclude keywords associated with themes, enter the text or regular expression needed in the Ignore text field. In the Apply to field, select Themes to apply the text or regular expression to all themes.
Select Save.

After a new case is created, analytics are automatically run on the data when the review sets are added to the case. Themes for the review sets are generated as part of the analytics processing.

Review set query

If you select the Automatically create a For Review saved search after analytics checkbox, eDiscovery autogenerates review set query named For Review.

This query filters out duplicate items from the review set, allowing you to quickly review the unique items in the review set. This query is created only when you run analytics for a review set in the case. For more information about review set queries, see Query the data in a review set.

Ignore text

There are situations where certain text diminishes the quality of analytics, such as lengthy disclaimers that get added to email messages regardless of the content of the email. If you know of text that should be ignored, you can exclude it from analytics by specifying the text string and the analytics functionality (near-duplicates, email threading, themes, and relevance) that the text should be excluded for. Using regular expressions (RegEx) for ignored text is also supported.

Optical character recognition (OCR)

When this setting is turned on, OCR processing runs on image files. When OCR is applied to image files, text in these files is available in search results. OCR runs only on items processed during Advanced indexing (if this option is selected in the search query).

For example, if a large PDF file that is partially indexed or had other indexing errors is processed during Advanced indexing, OCR is applied. OCR processing only occurs on files that are reindexed during the Advanced indexing process. This means there might be situations where content are added to a review set, but some email attachments aren't processed for OCR because these files aren't processed during Advanced indexing.

After data is added to a review set, image text can be reviewed, searched, tagged, and analyzed. You can view the extracted text in the Text viewer of the selected image file in the review set. For more information, see:

Share via