Hi Szulakiewicz, Michal,
Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!
Real-Time Azure Speech-to-Text currently does not provide display form word-level timestamps directly in the JSON response. While the Batch Speech-to-Text API allows you to choose between "Display form" and "Lexical form" word-level timestamps, the Real-Time API only provides timestamps for lexical form words in the Words array.
The suggested workaround is to manually align the lexical word timestamps with the normalized text (DisplayText) by applying inverse text normalization (ITN), capitalization, and punctuation detection to the lexical words. This process can be error-prone and time-consuming.
Since this functionality is not directly supported, it is recommended to submit a feature request to Microsoft Azure to add support for display form word-level timestamps in the Real-Time Speech-to-Text API, similar to the Batch API. Here's the link to the Azure Feedback Forum: Post idea · Community (azure.com). This feature would eliminate the need for manual alignment and improve the usability of the API for scenarios like yours.
Hope this helps. Do let us know if you have any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.