What kind of email/document properties are considered when determining duplicate files?

장경원 0 Reputation points
2024-09-02T00:15:45.08+00:00

I am using a Microsoft E5 (no Teams) license in order to export some contents, and I need explanation on the followings:

  1. According to https://learn.microsoft.com/en-us/purview/ediscovery-de-duplication-in-search-results#more-information, combination of three email properties are used to determine duplicate documents. However, it's quite ambiguous as to what kind of email metadata (e.g., smtp fields, body text) belongs to conversation topic and bodytaginfo. I want to know the specific parameters that are used to calculate the two properties.
  2. Moreover, I want to know how contents like office documents created/uploaded in the cloud (e.g., Powerpoint, Teams, Sharepoint) besides email messages are deduplicated.
  3. Also, if I have multiple custodians and locations in the search condition, how are they deduplicated?
Microsoft 365
Microsoft 365
Formerly Office 365, is a line of subscription services offered by Microsoft which adds to and includes the Microsoft Office product line.
5,197 questions
Windows 365 Enterprise
Microsoft Purview
Microsoft Purview
A Microsoft data governance service that helps manage and govern on-premises, multicloud, and software-as-a-service data. Previously known as Azure Purview.
1,247 questions
{count} votes

1 answer

Sort by: Most helpful
  1. NIKHILA NETHIKUNTA 3,505 Reputation points Microsoft Vendor
    2024-09-03T07:28:47.7633333+00:00

    @장경원
    Thank you for the question and using Microsoft Q&A platform.
    According to the Microsoft documentation, the three email properties used to determine duplicate documents are:

    1. ConversationTopic
    2. BodyTagInfo
    3. InternetMessageId

    ConversationTopic is a property that represents the subject of the email conversation. It's calculated based on the email's subject line and the conversation thread. Specifically, it's a combination of the email's subject, the sender's email address, and the conversation thread ID.

    BodyTagInfo is a property that represents the email body content. It's calculated based on the email body text, including any formatting and structure. However, the exact parameters used to calculate BodyTagInfo are not publicly documented by Microsoft.

    As for your second question, the deduplication process for non-email content like Office documents created/uploaded in the cloud is based on the content of the documents. The system compares the content of each document to identify duplicates. This process is based on the content of the documents, not on any specific metadata.

    Finally, if you have multiple custodians and locations in the search condition, the system will deduplicate the results based on the combination of the three email properties mentioned earlier. This means that if two emails have the same ConversationTopic, BodyTagInfo, and InternetMessageId, they will be considered duplicates regardless of the custodian or location they came from.

    I hope this helps! Let us know if you have any further questions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.