Trainable classifier not working after being published
I wrote a script that creates random employee records including PII and exports it out to Excel. I uploaded over 100 Excel files for seeding data to create a trainable classifier. After the classifier was published, it gives this error: "employee_records_20241028_142530_1.xlsx" does not contain "Test_PII_A" when the classifier is tested. I added a a screenshot of what was used for seeding and testing. The file used for testing is exactly the same as the seeded information. Any ideas as to why this would happen?
Microsoft Purview
-
jpcapone 1,491 Reputation points
2024-10-28T19:19:41.67+00:00 I just tested one of the files individually using an SIT and that doesn't work either. So it has to have something to do with the way the files are generated?
-
phemanth 11,295 Reputation points • Microsoft Vendor
2024-10-29T08:45:40.2133333+00:00 Thanks for reaching out to Microsoft Q&A.
It looks like you're encountering issue with your trainable classifier in Microsoft Purview after publishing it.
Here are some potential reasons for the error and steps to troubleshoot the problem:
- Data Format and Structure: Ensure that the structure of the seeded Excel files matches the expected format for the classifier. Check for any discrepancies in column names, data types, or formatting that might prevent the classifier from recognizing the PII.
- Content Variation: If the seeded data is randomly generated, ensure that the specific term "Test_PII_A" is present in the files exactly as it was defined in the classifier's training data. Even slight variations (like extra spaces, different casing, etc.) can cause the classifier to fail in recognizing the term.
- Classifier Configuration: Double-check the configuration of your classifier. Ensure that the keywords or patterns you're trying to detect are correctly defined and that the classifier is set up to look for them in the right context.
- Testing Methodology: When testing the classifier, make sure you are using the same context and method as when you seeded the data. If you used specific options or settings during the training phase, replicate those during testing.
- SIT (Sample Input Test): Since you mentioned testing with an SIT, ensure that the SIT is correctly configured and that it contains the same data as your seeded records. If it doesn't work with the SIT, this might indicate an issue with the classifier's ability to recognize the data.
- Re-Training the Classifier: If all else fails, consider re-training the classifier with a smaller, controlled dataset that you know works. Gradually increase the complexity of the dataset to identify any issues.
Hope this helps. Do let us know if you any further queries.
-
jpcapone 1,491 Reputation points
2024-10-29T13:54:54.8566667+00:00 - Content Variation: If the seeded data is randomly generated, ensure that the specific term "Test_PII_A" is present in the files exactly as it was defined in the classifier's training data. Even slight variations (like extra spaces, different casing, etc.) can cause the classifier to fail in recognizing the term.
Are you saying that the name of trainable classifier needs to be present within the seeded data? The screenshot was representative of the seeded data. Shouldn't a document with the same format be identified by the trainable classifier after it is published?
-
phemanth 11,295 Reputation points • Microsoft Vendor
2024-10-30T08:32:18.24+00:00 @jpcapone I understand your concern! To clarify, the name of the trainable classifier itself doesn’t need to be present in the seeded data. Instead, what matters is that the specific terms or patterns you want the classifier to recognize (like “Test_PII_A”) are included in the seeded data exactly as defined during training.
If your document format is consistent and matches what was used for training, it should ideally be identified by the classifier after publication. However, if the classifier is still failing to recognize the data, it could be due to:
- Subtle Differences: Even small variations in the data (like extra spaces or different casing) can affect recognition.
- Classifier Configuration: Ensure that the classifier is set up to look for the correct patterns in the right context.
-
phemanth 11,295 Reputation points • Microsoft Vendor
2024-11-05T16:53:59.2633333+00:00 @jpcapone We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
-
jpcapone 1,491 Reputation points
2024-11-05T17:29:38.47+00:00 What does this mean?
- Classifier Configuration: Double-check the configuration of your classifier. Ensure that the keywords or patterns you're trying to detect are correctly defined and that the classifier is set up to look for them in the right context.
-
phemanth 11,295 Reputation points • Microsoft Vendor
2024-11-06T18:28:36.01+00:00 Keywords or Patterns: These are the specific terms or data types (like “Test_PII_A”) that your classifier is designed to detect. You need to make sure that these keywords are defined accurately in the classifier’s settings.
Correct Definition: This means that the keywords should be spelled correctly, formatted properly, and reflect exactly what you want the classifier to find. For instance, if you want to detect “Test_PII_A,” it should be entered exactly as such, without any variations.
Context: The classifier needs to be configured to look for these keywords in the right context. For example, if your classifier is set to look for “Test_PII_A” only in certain columns or sections of your data, it won’t recognize it if it appears elsewhere.
In summary, double-checking the classifier configuration involves verifying that the terms you want to detect are correctly defined and that the classifier is set to search for them in the appropriate places within your data. This step is crucial for ensuring that the classifier functions as intended
-
jpcapone 1,491 Reputation points
2024-11-11T15:31:45.4833333+00:00 How do I do this "You need to make sure that these keywords are defined accurately in the classifier’s settings"?
-
phemanth 11,295 Reputation points • Microsoft Vendor
2024-11-11T19:03:56.5666667+00:00 @jpcapone To effectively define keywords in Microsoft Purview classifier settings, follow these structured guidelines:
- Understanding Classifier Types
- Sensitive Information Types (SITs): These are pattern-based classifiers that detect specific types of sensitive information (e.g., social security numbers, credit card numbers).
- Trainable Classifiers: These classifiers learn from examples you provide, identifying content based on patterns rather than specific keywords.
- Creating Custom Sensitive Information Types (SITs)
Use Keyword Dictionaries:
Create keyword dictionaries to manage keywords efficiently. These can support up to 1 MB of terms and can be used in custom SITs.
Steps to Create a Keyword Dictionary:
- Compile keywords in a text file, ensuring each keyword is on a separate line.
- Save the file with Unicode encoding.
- Use PowerShell to create the dictionary:
$fileData = [System.IO.File]::ReadAllBytes('<filename>') New-DlpKeywordDictionary -Name <name> -Description <description> -FileData $fileData
- Defining Patterns for Classifiers
- Add Primary Elements: When creating a custom SIT, specify a primary element using the keyword dictionary.
- Character Proximity: Define how close supporting elements must be to the primary element for detection. This can enhance accuracy.
- Supporting Elements: Include additional checks or keywords that must be present alongside the primary keyword to increase detection confidence.
- Testing and Feedback Mechanisms
- Test Your Classifier: Upload sample files to test the classifier's effectiveness. Use the testing results to refine your keywords and patterns.
- Match/Not a Match Feedback: Utilize the feedback mechanism to indicate whether the classifier correctly identified sensitive information. This feedback can help tune the classifier for better accuracy.
- Considerations for Language and Character Sets
- Double-Byte Character Support: If your keywords include double-byte characters (e.g., Chinese, Japanese), create separate keyword lists for these languages and for English.
- Regex Patterns: When defining regex patterns, ensure they are correctly formatted and escape any special characters as needed.
- Refining Classifier Accuracy
- Adjust Confidence Levels: Set appropriate confidence levels for your patterns. High confidence levels reduce false positives but may miss some matches, while lower levels may catch more but increase false positives.
- Review and Modify Existing SITs: Clone and modify built-in SITs to include additional conditions or keywords that are relevant to your organization’s needs.
-
jpcapone 1,491 Reputation points
2024-11-11T22:20:28.5133333+00:00 Ok so is the recommendation to create keyword dictionaries instead of using trainable classifiers?
-
phemanth 11,295 Reputation points • Microsoft Vendor
2024-11-12T17:01:18.6733333+00:00 Use Keyword Dictionaries if:
- You have a limited and well-defined set of keywords.
- The data structure is consistent and predictable.
- You need quick and straightforward implementation.
Use Trainable Classifiers if:
- Your data is diverse and may change over time.
- You require a more nuanced understanding of content beyond exact matches.
- You have the resources to train and refine classifiers over time.
In summary, the choice between keyword dictionaries and trainable classifiers should be based on the complexity of your data and your specific detection needs. If your situation allows, consider using both methods in conjunction to leverage their respective strengths.
Sign in to comment