Increase classifier accuracy

Artikkeli
01/02/2025

Classifiers, such as sensitive information types (SIT) and trainable classifiers, are used in various types of policies to identify sensitive information. Like most such models, sometimes they identify an item as being sensitive that isn't. Or, they may not identify an item as being sensitive when it actually is. These are called false positives and false negatives.

This article shows you how to confirm whether items matched by a classifier are true positives (a Match) or false positives (Not a match) and provide Match/Not a match feedback. You can use that feedback to tune your classifiers to increase accuracy. You can also send redacted versions of the document as well as the Match, Not a Match feedback to Microsoft if you want to help increase the accuracy of the classifiers that Microsoft provides.

The Match, Not a match and Contextual Summary experiences are available in:

Data Explorer - for SharePoint sites, OneDrive sites
Content Explorer - for SharePoint sites, OneDrive sites
Sensitive Information Type Matched Items page - for SharePoint sites, OneDrive sites
Trainable Classifier Matched Items page - for SharePoint sites, OneDrive sites
Microsoft Purview Data Loss Prevention (DLP) Alerts page - for SharePoint sites, OneDrive, and emails in Exchange
Microsoft Threat Protection (MTP) Alerts page - for SharePoint sites, OneDrive sites, and emails in Exchange

The Contextual Summary experience is available in:

Microsoft Purview Information Protection (MIP) Auto-labeling simulation matched items - for SharePoint sites, OneDrive sites

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview trials hub. Learn details about signing up and trial terms.

Applies to

Classifier	Contextual summary	Redacted preview panel	Match and Not a Match
SIT	Yes	Yes	Yes
Custom SIT	Yes	No	Yes
Fingerprint SIT	No	No	Yes
Exact data match SIT	No*	No	No
Named entities	No*	No	No
Credential scan	No	No	No
Built-in Trainable classifiers	Yes **	Yes	Yes
Custom trainable classifier	No	No	Yes

* These classifiers are supported in MIP Auto-labeling simulation matched items - for SharePoint sites and OneDrive sites.

** List of Built-in Trainable classifiers and which support contextual summary.

Important

The match/not a match feedback and contextual summary experience support items in: SharePoint sites & OneDrive sites - for Content Explorer, Sensitive Information Type and Trainable Classifier Matched Items, DLP Alerts and MTP Alerts. Emails in Exchange - for DLP Alerts and MTP Alerts. The contextual summary experience supports items in: SharePoint sites and OneDrive sites - for MIP simulation matched items

Licensing and Subscriptions

For information on the relevant licensing and subscriptions, see the licensing requirements for Data classification analytics: Overview Content & Activity Explorer.

Known limitations

The contextual summary only shows a limited number of matches in any given item, not all matches.
The contextual summary and feedback experience is only available for items created or updated after the feedback experience was enabled for the tenant. Items that were classified before the feature was enabled may not have the contextual summary and feedback experience available.

How to evaluate match accuracy and provide feedback

The contextual summary experience, where you indicate whether a matched item is a true positive (Match) or a false positive (Not a match), is similar across all of the places it surfaces.

Important

You must have already deployed DLP policies that use either SITs or trainable classifiers to OneDrive sites, SharePoint sites, or Exchange mailboxes. You must also have had items match before any items appear in the Contextual summary page.

Using Content Explorer

This example shows you how to use the Contextual Summary tab to give feedback.

Depending on the portal you're using, navigate to one of the following locations:
- Sign in to the Microsoft Purview portal > Solutions > Data Lifecycle Management > Explorers > Content explorer.
- Sign in to the Microsoft Purview compliance portal > Solutions > Data classification > Content explorer.
Type the name of the SIT or trainable classifier that you want to check matches for in Filter on labels, info types, or categories.
Select the SIT.
Select the location and make sure that there's a non-zero value in the Files column. (The only supported locations are SharePoint and OneDrive.)
Open the folder and then select a document.
Select the link in the Sensitive info type column for the document to see which SITs the item matched and the confidence level.
Choose Close
Open a document and select the Contextual Summary tab.
Review the item and confirm whether or not it's a match.
If it's a match, choose Close. You're finished.
If it's not a match, choose Not a match.
If you make a mistake and chose the wrong option, select Withdraw feedback next to Close. This puts the item back into the Not a match/Match state.
Review the item and redact or un-redact any text.
Choose Close.

Using Sensitive Information Type Matched Items page

You can access the same feedback mechanisms in the Sensitive Info types page.

Depending on the portal you're using, navigate to one of the following locations:
- Sign in to the Microsoft Purview portal > Solutions > Data Lifecycle Management > Classifiers > Sensitive info types.
- Sign in to the Microsoft Purview compliance portal > Solutions > Data classification > Classifiers > Sensitive info types.
In the Search field, enter the name of the SIT whose accuracy you want to check.
Open the SIT. This brings up Overview tab. Here you can see the count of the number of items that match, a count of the number of items that aren't a match, and the number of items with feedback.
Select the Matched items tab.
Open the folder and select a document. Only SharePoint, OneDrive are supported locations here. Make sure that there's a non-zero value in the Files column.
Select the link in the Sensitive info type column for an item to see which SITs the item matched and the confidence level.
Choose Close.
Open a document and then select the Contextual Summary tab.
Review the item and confirm whether it's a match.
If it's a match, choose Match and then Close.
If it isn't a match, choose Not a Match.
If you make a mistake and select the wrong option, select Withdraw feedback next to Close. This puts the item back into the Not a match/Match state.
Choose Close.

Using Trainable Classifier Matched Items page

Depending on the portal you're using, navigate to one of the following locations:
- Sign in to the Microsoft Purview portal > Solutions > Data Lifecycle Management > Classifiers > Trainable classifiers.
- Sign in to the Microsoft Purview compliance portal > Solutions > Data classification > Classifiers > Trainable classifiers.
Select the trainable classifier whose accuracy you want to check.
Open the trainable classifier. This brings up Overview tab. Here you can see the count of the number of items that match, a count of the number of items that aren't a match, and the number of items with feedback.
Select the Matched items tab.
Open the folder and open a document. Only SharePoint, OneDrive are supported locations here. Make sure that there's a non-zero value in the Files column.
Open a document and then select the Contextual Summary tab.
Review the item and confirm whether it's a match.
If it's a match, choose Match and then choose Close.
If it isn't a match, choose Not a Match.
If you make a mistake and select the wrong option, choose Withdraw feedback next to Close. This puts the item back into the Not a match/Match state.
Choose Close.

Using Data Loss Prevention Alerts page

Depending on the portal you're using, navigate to one of the following locations:
- Sign in to the Microsoft Purview portal > Solutions > Data loss prevention > Alerts.
- Sign in to the Microsoft Purview compliance portal > Solutions > Data loss prevention > Alerts.
Choose an alert.
Choose View details.
Choose the Events tab.
Maximize the Details tab.
Review the item and confirm whether it's a match.
Choose Actions.
If it's a match, close the window. You're finished.
If it's not a match, choose Actions and then Not a match.
Review the item and redact or un-redact any text.
Close the window.

Using the feedback to tune your classifiers

If your SITs or trainable classifiers are returning too many false positives based on the feedback, try some of these options to refine them and increase their accuracy.

Trainable classifiers

Retraining custom classifiers is no longer supported. If you need to improve the accuracy of the trainable classifiers you created, remove the classifier and begin fresh with larger sample sets. For more information, see Get started with trainable classifiers.

Sensitive information types

Increase the thresholds of sensitive information types found to determine severity. It's okay to use different thresholds for individual classifiers.
Understand confidence levels and how they're defined. Try using a low confidence with high instance count, or a higher confidence level with a low instance count.
Clone and modify the built-in SITs to include other conditions, such as the presences of keywords, more stringent value matching, or stronger formatting requirements.
Modify a custom SIT to exclude known prefixes, suffixes, or patterns. For example, a custom SIT to detect phone numbers might trigger for every email if your email signatures or document headers include phone numbers. Excluding your organization's phone number sequences from your custom SIT can prevent the rule from triggering for every email or document.
Include more dictionary-based SITs as conditions to narrow down the matches to those items that talk about the relevant articles. For example, a rule for matching patient diagnostics may be enhanced by requiring the presence of words like diagnostic, diagnosis, condition, symptom, and patient.
For named-entity SITs, like All Full Names, it’s best to set a higher instance count threshold, like 10 or 50. If both the person names and the Social Security Numbers (SSNs) are detected together, it’s more likely that the SSNs are truly SSNs, and we reduce the risk that the policy doesn’t trigger because too few SSNs are detected.

Jaa