Document Layout skill
Note
This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
The Document Layout skill analyzes a document to extract regions of interest and their inter-relationships to produce a syntactical representation of the document in Markdown format. This skill uses the Document Intelligence layout model provided in Azure AI Document Intelligence.
This article is the reference documentation for the Document Layout skill. For usage information, see Structure-aware chunking and vectorization.
The Document Layout skill calls the Document Intelligence Public preview version 2024-07-31-preview. It's currently only available in the following Azure regions:
- East US
- West US2
- West Europe
- North Central US
Supported file formats include:
- .JPEG
- .JPG
- .PNG
- .BMP
- .TIFF
- .DOCX
- .XLSX
- .PPTX
- .HTML
Note
This skill is bound to Azure AI services and requires a billable resource for transactions that exceed 20 documents per indexer per day. Execution of built-in skills is charged at the existing Azure AI services pay-as-you go price.
@odata.type
Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill
Data limits
- For PDF and TIFF, up to 2,000 pages can be processed (with a free tier subscription, only the first two pages are processed).
- Even if the file size for analyzing documents is 500 MB for Azure AI Document Intelligence paid (S0) tier and 4 MB for Azure AI Document Intelligence free (F0) tier, indexing is subject to the indexer limits of your search service tier.
- Image dimensions must be between 50 pixels x 50 pixels or 10,000 pixels x 10,000 pixels.
- If your PDFs are password-locked, remove the lock before running the indexer.
Supported languages
Refer to Azure AI Document Intelligence layout model supported languages for printed text.
Limitations
During the public preview, this skill has the following restrictions:
- The skill can't extract images embedded within documents.
- Page numbers are not included in the generated output.
- The skill is not suitable for large documents requiring more than 5 minutes of processing in the AI Document Intelligence layout model. The skill will time out, but charges will still apply to the AI Services multi-services resource if it is attached to the skillset for billing purposes. Ensure documents are optimized to stay within processing limits to avoid unnecessary costs.
Skill parameters
Parameters are case-sensitive.
Parameter name | Allowed Values | Description |
---|---|---|
outputMode |
oneToMany |
Controls the cardinality of the output produced by the skill. |
markdownHeaderDepth |
h1 , h2 , h3 , h4 , h5 , h6(default) |
This parameter describes the deepest nesting level that should be considered. For instance, if the markdownHeaderDepth is indicated as "h3" any markdown section that’s deeper than h3 (that is, #### and deeper) is considered as "content" that needs to be added to whatever level its parent is at. |
Skill inputs
Input name | Description |
---|---|
file_data |
The file that content should be extracted from. |
The "file_data" input must be an object defined as:
{
"$type": "file",
"data": "BASE64 encoded string of the file"
}
Alternatively, it can be defined as:
{
"$type": "file",
"url": "URL to download file",
"sasToken": "OPTIONAL: SAS token for authentication if the URL provided is for a file in blob storage"
}
The file reference object can be generated in one of following ways:
Setting the
allowSkillsetToReadFileData
parameter on your indexer definition to true. This setting creates a path/document/file_data
that's an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Azure Blob storage.Having a custom skill returning a JSON object defined that provides
$type
,data
, orurl
andsastoken
. The$type
parameter must be set tofile
, anddata
must be the base 64-encoded byte array of the file content. Theurl
parameter must be a valid URL with access for downloading the file at that location.
Skill outputs
Output name | Description |
---|---|
markdown_document |
A collection of "sections" objects, which represent each individual section in the Markdown document. |
Sample definition
{
"skills": [
{
"description": "Analyze a document",
"@odata.type": "#Microsoft.Skills.Util.DocumentLayoutAnalysisSkill",
"context": "/document",
"outputMode": "oneToMany",
"markdownHeaderDepth": "h3",
"inputs": [
{
"name": "file_data",
"source": "/document/file_data"
}
],
"outputs": [
{
"name": "markdown_document",
"targetName": "markdown_document"
}
]
}
]
}
Sample output
{
"markdown_document": [
{
"content": "Hi this is Jim \r\nHi this is Joe",
"sections": {
"h1": "Foo",
"h2": "Bar",
"h3": ""
},
"ordinal_position": 0
},
{
"content": "Hi this is Lance",
"sections": {
"h1": "Foo",
"h2": "Bar",
"h3": "Boo"
},
"ordinal_position": 1,
}
]
}
The value of the markdownHeaderDepth
controls the number of keys in the "sections" dictionary. In the example skill definition, since the markdownHeaderDepth
is "h3", there are three keys in the "sections" dictionary: h1, h2, h3.