What is Personally Identifiable Information (PII) detection in Azure AI Language?
PII detection is one of the features offered by Azure AI Language, a collection of machine learning and AI algorithms in the cloud for developing intelligent applications that involve written language. The PII detection feature can identify, categorize, and redact sensitive information in unstructured text. For example: phone numbers, email addresses, and forms of identification. Azure AI Language supports general text PII redaction, as well as Conversational PII, a specialized model for handling speech transcriptions and the more informal, conversational tone of meeting and call transcripts. The service also supports Native Document PII redaction, where the input and output are structured document files.
What's new
The Text PII and Conversational PII detection preview API (version 2024-11-15-preview
) now supports the option to mask detected sensitive entities with a label beyond just redaction characters. Customers have the option to specify if personally identifiable information content such as names and phone numbers, i.e. “John Doe received a call from 424-878-9192”
, are masked with a redaction character, i.e. “******** received a call from ************”
, or masked with an entity label, i.e. “[PERSON_1] received a call from [PHONENUMBER_1]”
. More on how to specify the redaction policy style for your outputs can be found in our how-to guides.
The Conversational PII detection models (both version 2024-11-01-preview
and GA
) have been updated to provide enhanced AI quality and accuracy. The numeric identifier entity type now also includes Drivers License and Medicare Beneficiary Identifier.
As of June 2024, we now provide General Availability support for the Conversational PII service (English-language only). Customers can now redact transcripts, chats, and other text written in a conversational style (i.e. text with “um”s, “ah”s, multiple speakers, and the spelling out of words for more clarity) with better confidence in AI quality, Azure SLA support and production environment support, and enterprise-grade security in mind.
Tip
Try out PII detection in AI Foundry portal, where you can utilize a currently existing Language Studio resource or create a new AI Foundry resource
- Quickstarts are getting-started instructions to guide you through making requests to the service.
- How-to guides contain instructions for using the service in more specific or customized ways.
- The conceptual articles provide in-depth explanations of the service's functionality and features.
Typical workflow
To use this feature, you submit data for analysis and handle the API output in your application. Analysis is performed as-is, with no added customization to the model used on your data.
Create an Azure AI Language resource, which grants you access to the features offered by Azure AI Language. It generates a password (called a key) and an endpoint URL that you use to authenticate API requests.
Create a request using either the REST API or the client library for C#, Java, JavaScript, and Python. You can also send asynchronous calls with a batch request to combine API requests for multiple features into a single call.
Send the request containing your text data. Your key and endpoint are used for authentication.
Stream or store the response locally.
Native document support
A native document refers to the file format used to create the original document such as Microsoft Word (docx) or a portable document file (pdf). Native document support eliminates the need for text preprocessing prior to using Azure AI Language resource capabilities. Currently, native document support is available for the PiiEntityRecognition capability.
Currently PII supports the following native document formats:
File type | File extension | Description |
---|---|---|
Text | .txt |
An unformatted text document. |
Adobe PDF | .pdf |
A portable document file formatted document. |
Microsoft Word | .docx |
A Microsoft Word document file. |
For more information, see Use native documents for language processing
Get started with PII detection
To use PII detection, you submit text for analysis and handle the API output in your application. Analysis is performed as-is, with no customization to the model used on your data. There are two ways to use PII detection:
Development option | Description |
---|---|
Language studio | Language Studio is a web-based platform that lets you try entity linking with text examples without an Azure account, and your own data when you sign up. For more information, see the Language Studio website or language studio quickstart. |
REST API or Client library (Azure SDK) | Integrate PII detection into your applications using the REST API, or the client library available in various languages. For more information, see the PII detection quickstart. |
Reference documentation and code samples
As you use this feature in your applications, see the following reference documentation and samples for Azure AI Language:
Development option / language | Reference documentation | Samples |
---|---|---|
REST API | REST API documentation | |
C# | C# documentation | C# samples |
Java | Java documentation | Java Samples |
JavaScript | JavaScript documentation | JavaScript samples |
Python | Python documentation | Python samples |
Responsible AI
An AI system includes not only the technology, but also the people who use it, the people affected by it, and the deployment environment. Read the transparency note for PII to learn about responsible AI use and deployment in your systems. For more information, see the following articles:
Example scenarios
- Apply sensitivity labels - For example, based on the results from the PII service, a public sensitivity label might be applied to documents where no PII entities are detected. For documents where US addresses and phone numbers are recognized, a confidential label might be applied. A highly confidential label might be used for documents where bank routing numbers are recognized.
- Redact some categories of personal information from documents that get wider circulation - For example, if customer contact records are accessible to frontline support representatives, the company can redact the customer's personal information besides their name from the version of the customer history to preserve the customer's privacy.
- Redact personal information in order to reduce unconscious bias - For example, during a company's resume review process, they can block name, address and phone number to help reduce unconscious gender or other biases.
- Replace personal information in source data for machine learning to reduce unfairness – For example, if you want to remove names that might reveal gender when training a machine learning model, you could use the service to identify them and you could replace them with generic placeholders for model training.
- Remove personal information from call center transcription – For example, if you want to remove names or other PII data that happen between the agent and the customer in a call center scenario. You could use the service to identify and remove them.
- Data cleaning for data science - PII can be used to make the data ready for data scientists and engineers to be able to use these data to train their machine learning models. Redacting the data to make sure that customer data isn't exposed.
Next steps
There are two ways to get started using the entity linking feature:
- Language Studio, which is a web-based platform that enables you to try several Language service features without needing to write code.
- The quickstart article for instructions on making requests to the service using the REST API and client library SDK.