Create an EDM SIT sample file (New experience)
Creating and making an exact data match (EDM) based sensitive information type (SIT) available is a multi-phase process. They can be used in Microsoft Purview data loss prevention policies, eDiscovery and certain content governance tasks.
Tip
If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview trials hub. Learn details about signing up and trial terms.
Applies to
- New experience
If you want to create an EDM SIT using the classic experience see, Create EDM SIT classic experience.
Before you begin
- Make sure you've complete the steps in Export source data for exact data match based sensitive information type.
Formatting the sample file
The system will extract the column names from the sample file to create the schema, and will recommend base SITs to map the sample field data to. It must be formatted identically to your source sensitive information table file and should contain synthetic values that are representative of your actual data. The file can be saved in .csv (comma-separated values), .tsv (tab-separated values), or pipe-separated (|) format, but should be the same as your actual source sensitive information table file. The .tsv format is recommended in cases where your data values include commas, such as street addresses.
- Use about 10-20 rows of data to ensure that the system has enough samples to work with.
- Field values that contain commas must be enclosed in quotes ".
- The first row must be the header row and contain column names.
- The file must contain at least one row of data.
- Each row of data must contain the correct number of fields, corresponding to the headers.
- The sample file can contain up to 32 columns.
- The sample file can't exceed 2.5 MB in size.
- Column (field) names must start with a letter, be at least three characters long, and consist of only alphanumeric characters (A-Z, a-z, 0-9) and can’t include spaces, underscores, or other special characters.
For example, if your actual data uses tab delimited (.tsv) format and looks like this:
Then your sample file must have the same column headers, but use synthetic values for the rows, like this
FirstName | LastName | PatientNumber | CreditCardNumber |
---|---|---|---|
Eric | Solomon | 987-65-4321 | 9000000000000000 |
Lisa | Taylor | 123-45-6789 | 500000000000000 |
Andre | Lawson | 234-56-7890 | 200000000000000 |
How to use the sample file templates
If you're in the U.S. Healthcare, U.S. Financial Services, or U.S. Insurance industry verticals, you can start with the following sample file templates to speed up the sample file creation process. These files contain the most commonly used column headers across the respective industries as a well as synthetic values in the fields.
To use these templates:
- Download the sample file template for your industry.
- Compare the column headers in the template to your actual source data and pick the ones you want to use as primary fields in your customized sample file.
- Compare the formatting of your actual source data with the formatting of the synthetic values. Change the formatting of the synthetic values to match the formatting of your source data values.
- Save your customized sample file to use when you create EDM SIT schema and rule package.
Tip
When working in the new experience, you have the option to upload a sample file or enter the sample file values manually. We recommend creating the sample file.
Next step
- For new experience: Create EDM SIT schema and rule package