Native document support for Azure AI Language (preview)

Important

  • Azure AI Language public preview releases provide early access to features that are in active development.
  • Features, approaches, and processes can change, before General Availability (GA), based on user feedback.

Azure AI Language is a cloud-based service that applies Natural Language Processing (NLP) features to text-based data. The native document support capability enables you to send API requests asynchronously, using an HTTP POST request body to send your data and HTTP GET request query string to retrieve the status results. Your processed documents are located in your Azure Blob Storage target container.

A native document refers to the file format used to create the original document such as Microsoft Word (docx) or a portable document file (pdf). Native document support eliminates the need for text preprocessing before using Azure AI Language resource capabilities. Currently, native document support is available for the following capabilities:

  • Personally Identifiable Information (PII). The PII detection feature can identify, categorize, and redact sensitive information in unstructured text. The PiiEntityRecognition API supports native document processing.

  • Document summarization. Document summarization uses natural language processing to generate extractive (salient sentence extraction) or abstractive (contextual word extraction) summaries for documents. Both AbstractiveSummarization and ExtractiveSummarization APIs support native document processing.

Supported document formats

Applications use native file formats to create, save, or open native documents. Currently PII and Document summarization capabilities supports the following native document formats:

File type File extension Description
Text .txt An unformatted text document.
Adobe PDF .pdf A portable document file formatted document.
Microsoft Word .docx A Microsoft Word document file.

Input guidelines

Supported file formats

Type support and limitations
PDFs Fully scanned PDFs aren't supported.
Text within images Digital images with embedded text aren't supported.
Digital tables Tables in scanned documents aren't supported.

Document Size

Attribute Input limit
Total number of documents per request ≤ 20
Total content size per request ≤ 10 MB

Request headers and parameters

parameter Description
-X POST <endpoint> Specifies your Language resource endpoint for accessing the API.
--header Content-Type: application/json The content type for sending JSON data.
--header "Ocp-Apim-Subscription-Key:<key> Specifies the Language resource key for accessing the API.
-data The JSON file containing the data you want to pass with your request.