Entity Recognition cognitive skill (v2)

Article
08/28/2024

The Entity Recognition skill (v2) extracts entities of different types from text. This skill uses the machine learning models provided by Text Analytics in Azure AI services.

Important

The Entity Recognition skill (v2) (Microsoft.Skills.Text.EntityRecognitionSkill) is now discontinued replaced by Microsoft.Skills.Text.V3.EntityRecognitionSkill. Follow the recommendations in Deprecated skills to migrate to a supported skill.

Note

As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to attach a billable Azure AI services resource. Charges accrue when calling APIs in Azure AI services, and for image extraction as part of the document-cracking stage in Azure AI Search. There are no charges for text extraction from documents.

Execution of built-in skills is charged at the existing Azure AI services pay-as-you go price. Image extraction pricing is described on the Azure AI Search pricing page.

@odata.type

Microsoft.Skills.Text.EntityRecognitionSkill

Data limits

The maximum size of a record should be 50,000 characters as measured by String.Length. If you need to break up your data before sending it to the key phrase extractor, consider using the Text Split skill. If you do use a text split skill, set the page length to 5000 for the best performance.

Skill parameters

Parameters are case-sensitive and are all optional.

Parameter name	Description
`categories`	Array of categories that should be extracted. Possible category types: `"Person"`, `"Location"`, `"Organization"`, `"Quantity"`, `"Datetime"`, `"URL"`, `"Email"`. If no category is provided, all types are returned.
`defaultLanguageCode`	Language code of the input text. The following languages are supported: `ar, cs, da, de, en, es, fi, fr, hu, it, ja, ko, nl, no, pl, pt-BR, pt-PT, ru, sv, tr, zh-hans`. Not all entity categories are supported for all languages; see note below.
`minimumPrecision`	A value between 0 and 1. If the confidence score (in the `namedEntities` output) is lower than this value, the entity is not returned. The default is 0.
`includeTypelessEntities`	Set to `true` if you want to recognize well-known entities that don't fit the current categories. Recognized entities are returned in the `entities` complex output field. For example, "Windows 10" is a well-known entity (a product), but since "Products" is not a supported category, this entity would be included in the entities output field. Default is `false`

Skill inputs

Input name	Description
`languageCode`	Optional. Default is `"en"`.
`text`	The text to analyze.

Skill outputs

Note

Not all entity categories are supported for all languages. The "Person", "Location", and "Organization" entity category types are supported for the full list of languages above. Only de, en, es, fr, and zh-hans support extraction of "Quantity", "Datetime", "URL", and "Email" types. For more information, see Language and region support for the Text Analytics API.

Output name	Description
`persons`	An array of strings where each string represents the name of a person.
`locations`	An array of strings where each string represents a location.
`organizations`	An array of strings where each string represents an organization.
`quantities`	An array of strings where each string represents a quantity.
`dateTimes`	An array of strings where each string represents a DateTime (as it appears in the text) value.
`urls`	An array of strings where each string represents a URL
`emails`	An array of strings where each string represents an email
`namedEntities`	An array of complex types that contains the following fields: category value (The actual entity name) offset (The location where it was found in the text) confidence (Higher value means it's more to be a real entity)
`entities`	An array of complex types that contains rich information about the entities extracted from text, with the following fields name (the actual entity name. This represents a "normalized" form) wikipediaId wikipediaLanguage wikipediaUrl (a link to Wikipedia page for the entity) bingId type (the category of the entity recognized) subType (available only for certain categories, this gives a more granular view of the entity type) matches (a complex collection that contains) text (the raw text for the entity) offset (the location where it was found) length (the length of the raw entity text)

Sample definition

  {
    "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
    "categories": [ "Person", "Email"],
    "defaultLanguageCode": "en",
    "includeTypelessEntities": true,
    "minimumPrecision": 0.5,
    "inputs": [
      {
        "name": "text",
        "source": "/document/content"
      }
    ],
    "outputs": [
      {
        "name": "persons",
        "targetName": "people"
      },
      {
        "name": "emails",
        "targetName": "contact"
      },
      {
        "name": "entities"
      }
    ]
  }

Sample input

{
    "values": [
      {
        "recordId": "1",
        "data":
           {
             "text": "Contoso corporation was founded by John Smith. They can be reached at contact@contoso.com",
             "languageCode": "en"
           }
      }
    ]
}

Sample output

{
  "values": [
    {
      "recordId": "1",
      "data" : 
      {
        "persons": [ "John Smith"],
        "emails":["contact@contoso.com"],
        "namedEntities": 
        [
          {
            "category":"Person",
            "value": "John Smith",
            "offset": 35,
            "confidence": 0.98
          }
        ],
        "entities":  
        [
          {
            "name":"John Smith",
            "wikipediaId": null,
            "wikipediaLanguage": null,
            "wikipediaUrl": null,
            "bingId": null,
            "type": "Person",
            "subType": null,
            "matches": [{
                "text": "John Smith",
                "offset": 35,
                "length": 10
            }]
          },
          {
            "name": "contact@contoso.com",
            "wikipediaId": null,
            "wikipediaLanguage": null,
            "wikipediaUrl": null,
            "bingId": null,
            "type": "Email",
            "subType": null,
            "matches": [
            {
                "text": "contact@contoso.com",
                "offset": 70,
                "length": 19
            }]
          },
          {
            "name": "Contoso",
            "wikipediaId": "Contoso",
            "wikipediaLanguage": "en",
            "wikipediaUrl": "https://en.wikipedia.org/wiki/Contoso",
            "bingId": "349f014e-7a37-e619-0374-787ebb288113",
            "type": null,
            "subType": null,
            "matches": [
            {
                "text": "Contoso",
                "offset": 0,
                "length": 7
            }]
          }
        ]
      }
    }
  ]
}

Note that the offsets returned for entities in the output of this skill are directly returned from the Text Analytics API, which means if you are using them to index into the original string, you should use the StringInfo class in .NET in order to extract the correct content. More details can be found here.

Warning cases

If the language code for the document is unsupported, a warning is returned and no entities are extracted.

Share via