Multilingual and emoji support in Language service features

Άρθρο
11/21/2024

Multilingual and emoji support has led to Unicode encodings that use more than one code point to represent a single displayed character, called a grapheme. For example, emojis like 🌷 and 👍 may use several characters to compose the shape with additional characters for visual attributes, such as skin tone. Similarly, the Hindi word अनुच्छेद is encoded as five letters and three combining marks.

Because of the different lengths of possible multilingual and emoji encodings, Language service features may return offsets in the response.

Offsets in the API response

Whenever offsets are returned the API response, remember:

Elements in the response may be specific to the endpoint that was called.
HTTP POST/GET payloads are encoded in UTF-8, which may or may not be the default character encoding on your client-side compiler or operating system.
Offsets refer to grapheme counts based on the Unicode 8.0.0 standard, not character counts.

Extracting substrings from text with offsets

Offsets can cause problems when using character-based substring methods, for example the .NET substring() method. One problem is that an offset may cause a substring method to end in the middle of a multi-character grapheme encoding instead of the end.

In .NET, consider using the StringInfo class, which enables you to work with a string as a series of textual elements, rather than individual character objects. You can also look for grapheme splitter libraries in your preferred software environment.

The Language service features returns these textual elements as well, for convenience.

Endpoints that return an offset will support the stringIndexType parameter. This parameter adjusts the offset and length attributes in the API output to match the requested string iteration scheme. Currently, we support three types:

textElement_v8 (default): iterates over graphemes as defined by the Unicode 8.0.0 standard
unicodeCodePoint: iterates over Unicode Code Points, the default scheme for Python 3
utf16CodeUnit: iterates over UTF-16 Code Units, the default scheme for JavaScript, Java, and .NET

If the stringIndexType requested matches the programming environment of choice, substring extraction can be done using standard substring or slice methods.

Κοινή χρήση μέσω

Multilingual and emoji support in Language service features

Offsets in the API response

Extracting substrings from text with offsets

See also

Σχόλια

Πρόσθετοι πόροι