Create the schema for exact data match based sensitive information types
Tip
If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview trials hub. Learn details about signing up and trial terms.
Applies to
- Classic exact data match (EDM) sensitive information type (SIT) creation experience.
Use the exact data match schema and sensitive information type pattern tool
If you aren't familiar with EDM-based SITS or their implementation, you should familiarize yourself with:
- Learn about sensitive information types
- Learn about exact data match based sensitive information types
- Get started with exact data match based sensitive information types
A single EDM schema can be used in multiple sensitive information types that use the same sensitive data table. You can create up to 10 different EDM schemas in a Microsoft 365 tenant.
Use the Exact Data Match Schema and Sensitive Information Type Tool
You can use this tool to help simplify the schema file creation process.
Prerequisites
- Perform the steps in Export source data for exact data match based sensitive information type.
Use the exact data match schema and sensitive information type pattern tool
Select the appropriate tab for the portal you're using. Depending on your Microsoft 365 plan, the Microsoft Purview compliance portal is retired or will be retired soon.
To learn more about the Microsoft Purview portal, see Microsoft Purview portal. To learn more about the Compliance portal, see Microsoft Purview compliance portal.
Sign in to the Microsoft Purview portal > Information Protection > Classifiers > EDM classifiers > EDM schemas (available when the New EMD experience is toggled to Off).
Choose Create EDM schema to open the schema tool configuration flyout.
Fill in an appropriate Name and Description.
Choose Ignore delimiters and punctuation for all schema fields if you want to apply the Ignore... behavior for the entire schema. For more information about configuring EDM to ignore case or delimiters, see Using the caseInsensitive and ignoredDelimiters fields for more details on this feature.
Fill in your desired values for your Schema field #1 and add more fields as needed. Each schema field must be identical to the column headers in your sensitive information source file.
If you want, set the per-field values for the following:
- Field is searchable
- Field is case-insensitive
- Choose delimiters and punctuation to ignore for this field
- Enter custom delimiters and punctuation for this field
Important
At least one, but no more than ten, of your schema fields must be designated as searchable.
Choose Save. Your schema is now listed and available for use.
Important
If you want to remove a schema that is already associated with an EDM SIT, you must first delete the EDM SIT. Deleting a schema that has a data store associated with it also deletes the data store within 24 hours.
Exporting the EDM schema file in XML format
If you created the EDM schema in the EDM schema tool, you must export the schema file in XML format. You'll need the XML file to complete the Hash and upload the sensitive information source table for exact data match sensitive information types phase.
To export the EDM schema file, use this syntax:
$Schema = Get-DlpEdmSchema -Identity "[your EDM Schema name]" Set-Content -Path ".\Schemafile.xml" -Value $Schema.EdmSchemaXML
Save this file for later use.
Create and upload the exact data match schema file manually
As you create your schema file, your column headers (data fields) must adhere to the following naming requirements:
- Must start with a letter and must consist of at least three alphanumeric characters.
- Must include only alphanumeric characters.
Use the following syntax for each column/data field:
<Field name="FieldName" searchable="true/false" caseInsensitive="true/false" ignoredDelimiters="delimiter characters" />
Using the caseInsensitive and ignoredDelimiters fields
The schema XML sample that follows makes use of the caseInsensitive
and the ignoredDelimiters
fields.
When you include the caseInsensitive
field set to the value of true
in your schema definition, EDM won't exclude an item based on case differences. For example, EDM sees the values FOO-1234 and fOo-1234 as being identical for the PatientID
field.
When you include the ignoredDelimiters
field with supported characters, EDM ignores those characters. So EDM sees the values FOO-1234 and FOO#1234 as being identical for the PatientID
field.
In this example, where both caseInsensitive
and ignoredDelimiters
are used, EDM sees FOO-1234 and fOo#1234 as identical and classifies the item as a patient record sensitive information type.
Both these parameters are used on a per field basis.
Important
If you configure spaces to be ignored, this will only be effective for primary field columns and for which a sensitive information type that can detect multi-word strings is defined. Otherwise, the comparison will be made against each individual word in the content being analyzed.
The ignoredDelimiters
flag supports any nonalphanumeric character, here are some examples:
- .
- -
- /
- _
- *
- ^
- #
- !
- ?
- [
- ]
- {
- }
- \
- ~
- ;
The ignoredDelimiters
flag doesn't support:
- characters 0-9
- A-Z
- a-z
- "
- ,
Important
When defining your EDM sensitive information type, ignoredDelimiters
will not affect how the Classification sensitive information type associated with the primary element in an EDM pattern identifies content in an item. So, if you configure ignoredDelimiters
for a searchable field, you have to make sure the sensitive information type used for a primary element based on that field will pick strings both with and without those characters present.
The number of columns in your sensitive information source table and the number of fields in your schema must match, order doesn't matter.
The characters that are used as token separators behave differently than the other delimiters. Here are some examples:
- \ (space)
- \t
- ,
- .
- ;
- ?
- !
- \r
- \n
When you include a token separator, EDM breaks the token where the separator is. For example, EDM sees the value Middle-Last Name into Middle-Last and Name for the LastName
field. If the ignoredDelimiters
is included for the LastName
field with the character '-', that action only happens after the value is broken. In the end, EDM would see the following values MiddleLast and Name.
To use the following characters as ignoredDelimiters
and not token separators, a SIT that matches the corresponding format needs to be associated with the field. For example, a SIT that detects a multi-word string with dashes in it needs to be associated with the LastName
field.
- .
- ;
- !
- ?
- \
It's possible to associate SITs with secondary elements using PowerShell.
Define the schema in XML format (similar to the following example). Name this schema file edm.xml and then configure it such that, for each column in the sensitive information source table, there's a line that uses the syntax:
\<Field name="" searchable=""/\>
.- Use column names for Field name values.
- Use
searchable="true"
for the fields that you want to be searchable and primary fields up to a maximum of five fields. At least one field must be searchable.
As an example, the following XML file defines the schema for a patient records database, with five fields specified as searchable:
PatientID
,MRN
,SSN
,Phone
, andDOB
.(You can copy, modify, and use our example.)
<EdmSchema xmlns="http://schemas.microsoft.com/office/2018/edm"> <DataStore name="PatientRecords" description="Schema for patient records" version="1"> <Field name="PatientID" searchable="true" caseInsensitive="true" ignoredDelimiters="-,/,*,#,^" /> <Field name="MRN" searchable="true" /> <Field name="FirstName" /> <Field name="LastName" /> <Field name="SSN" searchable="true" /> <Field name="Phone" searchable="true" /> <Field name="DOB" searchable="true" /> <Field name="Gender" /> <Field name="Address" /> </DataStore> </EdmSchema>
Once you have created the EDM schema file in XML format, you have to upload it to the cloud service.
To upload the database schema, run the following command:
New-DlpEdmSchema -FileData ([System.IO.File]::ReadAllBytes('.\\edm.xml')) -Confirm:$true
You'll be prompted to confirm, as follows:
Confirm
Are you sure you want to perform this action?
New EDM Schema for the data store 'patientrecords' will be imported.
[Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"):
Tip
If you want your changes to occur without confirmation, don't use
-Confirm:$true
in Step 3.
Note
It can take between 10-60 minutes to update the EDMSchema with additions. The update must complete before you execute steps that use the additions.