Partager via


Transparency note for Azure Video Indexer

An AI system includes not only the technology, but also the people who use it, the people affected by it, and the environment in which it's deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance.

What is transparency?

Microsoft’s Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system or share them with the people who use or are affected by your system.  

Microsoft’s Transparency Notes are part of a broader effort at Microsoft to put our AI principles into practice.

To find out more, see Microsoft AI principles.

An AI system includes not only the technology, but also the people who use it, the people affected by it, and the environment in which it's deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance.

Introduction to Azure AI Video Indexer

Azure AI Video Indexer (VI) is a cloud-based tool that processes and analyzes uploaded video and audio files to generate different types of insights. These insights include detected objects, people, faces, key frames and translations or transcriptions in at least 60 languages. The insights and their time frames are displayed in a categorized list on the Azure AI Video Indexer website where each insight can be seen by pressing its Play button.

While processing the files, the Azure AI Video Indexer employs a portfolio of Microsoft AI algorithms to analyze, categorize, and index the video footage. The resulting insights are then archived and can be comprehensively accessed, shared, and reused. For example, a news media outlet may implement a deep search for insights related to the Empire State Building and then reuse their findings in different movies, trailers, or promos.  

The basics of Azure AI Video Indexer

Azure AI Video Indexer is a cloud-based Azure AI services product that is integrated with Azure AI services. It allows you to upload video and audio files, process the video (including running AI models on them) and then save the processed files and resulting data to a cloud-based Azure Media Services account.

To process the media files, Azure AI Video Indexer employs AI technologies like Optical Character Recognition (OCR), Natural Language Processing (NLP), and hierarchical ontology models with voice tonality analysis to extract insights like brands, keywords, topics, and text-based emotion detection.

Azure AI Video Indexer’s capabilities include searching for insights in archives, promoting content accessibility, content moderation and content editing.

Insights categories include:

Insight category Description
Audio media For example, transcriptions, translations, audio event detection like clapping and crowd laughter, gun shots and explosions
Video media For example, faces, clothing detection
Video with audio media For example, named entities in transcripts, and Optical Character Recognition (OCR), for example, names of locations, people, or brands

For more information, see Introduction to Azure AI Video Indexer.

Key terms and features

Term Definition
Text-based emotion detection Emotions such as joy, sadness, anger, and fear that were detected via transcript analysis.
Insight The information and knowledge derived from the processing and analysis of video and audio files that generate different types of insights and can include detected objects, people, faces, key frames and translations or transcriptions. To view and download insights via the API, use the Azure AI Video Indexer portal.
Object detection The ability to identify and find objects in an image or video. For example, a table, chair, or window.
Facial detection Finds human faces in an image and returns bounding boxes indicating their locations. Face detection models alone do not find individually identifying features, only a bounding box marking the entire face. Facial detection doesn't involve distinguishing one fact from another face, predicting or classifying facial attributes, or creating a Face template.
Facial identification "One-to-many" matching of a face in an unmanipulated image to a set of faces in a secure repository. An example is a touchless access control system in a building that replaces or augments physical cards and badges in which a smart camera captures the face of one person entering a secured door and attempts to find a match from a set of images of faces of individuals who are approved to access the building. This process is implemented by Azure AI Face service and involves the creation of Face templates.
Face template Unique set of numbers generated from an image or video that represents the distinctive features of a face.
Observed people detection and matched faces Features that automatically detect and match people in media files. Observed people detection and matched faces can be set to display insights on people, their clothing, and the exact time frame of their appearance.
Keyword extraction The process of automatically detecting insights on the different keywords discussed in media files. Keywords extraction can extract insights in both single language and multi-language media files.
Deep search The ability to retrieve only relevant video and audio files from a video library by searching for specific terms within the extracted insights.
Labels The identification of visual objects and actions appearing in a frame. For example, identifying an object such as a dog, or an action such as running.
Named entities Feature that uses Natural Language Processing (NLP) to extract insights on the locations, people and brands appearing in audio and images in media files.
Natural Language Processing (NLP) The processing of human language as it is spoken and written.
Optical Character Recognition (OCR) Extracts text from images like pictures, street signs, and products in media files to create insights. For more information, see OCR technology.
Hierarchical Ontology Model A set of concepts or categories in a subject area or domain that possess shared properties and relationships.
Audio effects detection Feature that detects insights on a variety of acoustic events and classifies them into acoustic categories. Audio effect detection can detect and classify different categories such as laughter, crowd reactions, alarms and/or sirens.
Transcription, translation and language identification Feature that automatically detects, transcribes, and translates the speech in media files into over 50 languages.
Topics inference Feature that automatically creates inferred insights derived from the transcribed audio, OCR content in visual text, and celebrities recognized in the video.
Speaker diarization Feature that identifies each speaker in a video and attributes each transcribed line to a speaker. This allows for the identification of speakers during conversations and can be useful in a variety of scenarios.
Bring Your Own Model Feature that allows you to send insights and artifacts generated by Azure AI Video Indexer to external AI models.
Textual Video Summarization Feature that summarizes the uses artificial intelligence to summarize the content of a video.

Components of Azure AI Video Indexer

During the Azure AI Video Indexer procedure, a media file is processed using Azure APIs to extract different types of insights, as follows:

Component Definition
Video uploader The user uploads a media file to be processed by Azure AI Video Indexer.
Insights generation Azure services APIs such as Azure AI services OCR and Transcription, extract insights.
Internal AI models are run to generate insights like Detected Audio Events, Observed People, Detected Clothing, and Topics.
Insights processing Additional logic such as confidence level threshold filtering is applied to the output of Insights generation to create the final insights that are then displayed in the Azure AI Video Indexer portal and in the JSON file that can be downloaded from the portal.
Storage Output from the processed media file is saved in:

• Azure Storage
• Azure Search, where users can search for videos using specific insights like an actor’s name, a location, or a brand.

Notification The user receives notification that the indexing process has been completed.

Limited Access features of Azure AI Video Indexer

Facial recognition features of Azure AI Video Indexer (including facial detection, facial identification, facial templates, observed people detection, and matched faces) are Limited Access and are only available to Microsoft managed customers and partners, and only for certain use cases selected at the time of registration. Access to the facial identification and celebrity recognition capabilities requires registration. Facial detection does not require registration. To learn more, visit Microsoft’s Limited Access policy.

Approved commercial use cases for Limited Access features

Facial Identification to search for a face in a media or entertainment video archive: to find a face within a video and generate metadata for media or entertainment use cases only.

Celebrity Recognition: to detect and identify celebrities within images or videos in digital asset management systems, for accessibility and/or media and entertainment use cases only.

Approved public sector use cases for Limited Access features

Facial identification for preservation and enrichment of public media archives: to identify individuals in public media or entertainment video archives for the purposes of preserving and enriching public media only. Examples of public media enrichment include identifying historical figures in video archives and generating descriptive metadata.

Facial identification to:

  • assist law enforcement or court officials in prosecution or defense of a criminal suspect who has already been apprehended, to the extent specifically authorized by a duly empowered government authority in a jurisdiction that maintains a fair and independent judiciary OR
  • assist officials of duly empowered international organizations in the prosecution of abuses of international criminal law, international human rights law, or international humanitarian law.

Facial identification for purposes of providing humanitarian aid, or identifying missing persons, deceased persons, or victims of crimes.

Respect privacy

When used responsibly and carefully Azure AI Video Indexer is a valuable tool for many industries. To respect the privacy and safety of others, we recommend the following:  

  • Always respect an individual’s right to privacy, and only ingest videos for lawful and justifiable purposes.  
  • Do not purposely disclose inappropriate media showing young children or family members of celebrities or other content that may be detrimental or pose a threat to an individual’s personal freedom.  
  • Commit to respecting and promoting human rights in the design and deployment of your analyzed media.  
  • When using 3rd party materials, be aware of any existing copyrights or required permissions before distributing content derived from them.
  • Always seek legal advice when using media from unknown sources.
  • Always obtain appropriate legal and professional advice to ensure that your uploaded videos are secured and have adequate controls to preserve the integrity of your content and to prevent unauthorized access.
  • Provide a feedback channel that allows users and individuals to report issues with the service.  
  • Be aware of any applicable laws or regulations that exist in your area regarding processing, analyzing, and sharing media containing people.
  • Keep a human in the loop. Do not use any solution as a replacement for human oversight and decision-making.  
  • Fully examine and review the potential of any AI model you are using to understand its capabilities and limitations.

For more information, see Microsoft Global Human Rights Statement.

Example use cases for Azure AI Video Indexer

Azure AI Video Indexer can be used in multiple scenarios in a variety of industries, such as:  

  • Creating feature stories at news or media agencies by implementing deep searches for specific people and/or words to find what was said, by whom, where and when. Facial identification capabilities are Limited Access. For more information, visit Microsoft’s Limited Access policy.  
  • Creating promos and trailers using important moments previously extracted from videos. Azure AI Video Indexer can assist by adding keyframes, scene markers, timestamps and labelling so that content editors invest less time reviewing numerous files.
  • Promoting accessibility by translating and transcribing audio into multiple languages and adding captions, or by creating a verbal description of footage via OCR processing to enhance accessibility for the visually impaired.
  • Improving content distribution to a diverse audience in different regions and languages by delivering content in multiple languages using Azure AI Video Indexer’s transcription and translation capabilities.
  • Enhancing targeted advertising, industries like news media or social media can use Azure AI Video Indexer to extract insights to enhance the relevance of targeted advertising.
  • Enhancing user engagement using metadata, tags, keywords, and embedded customer insights to filter and tailor media to customer preferences.  
  • Moderating inappropriate content such as banned words using textual and visual content control to tag media as child approved or for adults only.
  • Accurately and quickly detecting violence incidents by classifying gunshots, explosions, and glass shattering in a smart-city system or in other public environments that include cameras and microphones.
  • Enhancing compliance with local standards by extracting text in warnings in online instructions and then translating the text for example, e-learning instructions for using equipment.
  • Enhancing and improving manual closed captioning and subtitles generation by leveraging Azure AI Video Indexer’s transcription and translation capabilities and by using the closed captions generated by Azure AI Video Indexer in one of the supported formats.
  • Transcribing videos in unknown languages by using language identification (LID) or multi language identification (MLID) to allow Azure AI Video Indexer to automatically identify the languages appearing in the video and generate the transcription accordingly.

Use case considerations

  • Avoid using Video Indexer for decisions that may have serious adverse impacts. Decisions based on incorrect output could have serious adverse impacts. Additionally, it is advisable to include human review of decisions that have the potential for serious impacts on individuals.
  • The Video Indexer text-based emotion detection was not designed to assess employee performance or the emotional state of an individual.
  • Bring Your Own Model
    • Azure AI Video Indexer isn't responsible for the way you use an external AI model. It is your responsibility to ensure that your external AI models are compliant with Responsible Artifical Intelligence standards.
    • Azure AI Video Indexer isn't responsible for the custom insights you create while using the Bring Your Own Model feature as they are not generated by Azure Video Indexer models.

Characteristics and limitations of Video Indexer

The intended use of Azure AI Video Indexer is to generate insights from recorded media and entertainment content. Extracted insights are created in a JSON file that lists the insights in categories. Each insight holds a list of unique elements, and each element has its own metadata and a list of its instances. For example, a face might have an ID, a name, a thumbnail, other metadata, and a list of its temporal instances. The output of some insights may also display a confidence score to indicate its accuracy level.

A JSON file can be accessed in three ways:

  • Azure AI Video Indexer portal, an easy-to-use solution that lets you evaluate the product, manage the account, and customize models.  
  • API integration, via a REST API, which lets you integrate the solution into your apps and infrastructure.  
  • Embeddable widget, which lets you embed the Azure AI Video Indexer insights, player, and editor experiences into your app to customize the insights displayed in a web interface. For example, the list can be customized to display insights only about people appearing in a video. To find videos that include a specific celebrity, a content editor can implement a deep search using the name appearing in the Face or People insights categories.

Video

  • Azure AI Video Indexer has a storage limit of 30 GB and 4 hours for uploaded, previously recorded videos.
  • Always upload high-quality video and audio content. The recommended maximum frame size is HD and frame rate is 30 FPS. A frame should contain no more than 10 people. When outputting frames from videos to AI models, only send around two or three frames per second. Processing 10 or more frames might delay the AI result. At least 1 minute of spontaneous conversational speech is required to perform analysis. Audio effects are detected in nonspeech segments only. The minimal duration of a nonspeech section is 2 seconds. Voice commands and singing aren't supported.
  • Lower accuracy of the generated insights might occur when people and faces recorded by cameras that are high-mounted, down-angled or with a wide field of view (FOV) might have fewer pixels.
  • Typically, small people or objects under 200 pixels and people who are seated might not be detected. People wearing similar clothes or uniforms might be detected as being the same person and are given the same ID number. People or objects that are obstructed might not be detected. Tracks of people with front and back poses might be split into different instances.
  • An observed person must first be detected and appear in the People category before they're matched. Tracks are optimized to handle observed people who frequently appear in the foreground. Obstructions like overlapping people or faces might cause mismatches between matched people and observed people. Mismatching might occur when different people appear in the same relative spatial position in the frame within a short period.
  • Dresses and skirts are categorized as Dresses or Skirts. Clothing the same color as a person’s skin isn't detected. A full view of the person is required. To optimize detection, both the upper and lower body should be included in the frame.
  • Avoid using the OCR results of signatures that are hard to read for both humans and machines. A better way to use OCR is to use it for detecting the presence of a signature for further analysis.
  • Named entities only detects insights in audio and images. Logos in a brand name might not be detected.
  • Detectors might misclassify objects in videos that are in a "birds-eye" view as there were trained with a frontal view of objects.

Audio

  • Avoid use of audio with loud background music or music with repetitive and/or linearly scanned frequency, audio effects detection is designed for nonspeech audio only and therefore can't classify events in loud music. Music with repetitive and/or linearly scanned frequency many be incorrectly classified as an alarm or siren.

Textual summarization notes

Important

When using textual summarization, it's important to note that the system is not intended to replace the full viewing experience, especially for content where details and nuances are crucial. It's also not desinged for summarizing highly sensitive or confidential videos where context and privacy are paramount.

  • Non-English languages: The Textual Video Summary was primarily tested and optimized for the English language. However, it's compatible with all languages supported by the specific GenAI model being used, that is, GPT3.5 Turbo or GPT4.0. So, when applied to non-English languages, the accuracy and quality of the summaries might vary. To mitigate this limitation, users employing the feature for non-English languages should be extra careful and verify the generated summaries for accuracy and completeness.
  • Videos with multiple languages: If a video incorporates speech in multiple languages, the Textual Video Summary might struggle to accurately recognize all the languages featured in the video content. Users should be aware of this potential limitation when utilizing the Textual Video Summarization feature for multilingual videos.
  • Highly specialized or technical videos: Video Summary AI models are typically trained on a wide variety of videos, including news, movies, and other general content. If the video is highly specialized or technical, the model might not be able to accurately extract the summary of the video.
  • Videos with poor audio quality nor OCR: Textual Video Summary AI models also rely on audio and other insights to extract the summary from the video, or on OCR to extract the text appearing on screen. If the audio quality is poor and there's no OCR identified, the model might not be able to accurately extract the summary from the video. 
  • Videos with low lighting or fast motion: Videos that are shot in low lighting or have fast motion might be difficult for the model to process the insights, resulting in poor performance. 
  • Videos with uncommon accents or dialects: AI models are typically trained on a wide variety of speech, including different accents and dialects. However, if the video contains speech with an accent or dialect that isn't well represented in the training data, the model might struggle to accurately extract the transcript from the video. 
  • Videos containing harmful content: Videos with harmful or sensitive content might be filtered out and excluded, leading to a partial summary.
  • User choices and customization: The Textual Summarization feature has settings that allow users to tailor the summarization process to their needs. These include summary length, quality, output format, and formal, casual, short, or long text styles. However these settings also introduce variability in the system’s performance. It can enhance your experience, but it might also influence the system’s accuracy and efficiency. It’s a balance between personalization and the system’s operational capabilities. You're expected to use the system responsibly, with an understanding of its limitations and the effect of your choices on the final output.
  • Textual summarization with keyframes: Summarization with keyframes is based on keyframes selection with shots detection. Therefore, any limitation that applies to shots detection applies to textual summarization with keyframes. Keyframe selection is based on a proprietary AI model that might make mistakes. Keyframe detection might not capture all the visual aspects of the video so they might be missed in the summary. In addition, there's a varying limit to the number of frames that can be used for summarizing a section of a video, so frames in sections filtered by harmful content detection or other filters might be discarded. So, the summarization results might be incomplete or incorrect for some parts or sections of the video.

Textual summarization using VI enabled by Arc notes

Textual summarization enabled by Arc (also known as using VI on an edge device) utilizes the Phi-3.5-mini-instruct model. The Phi-3.5 model has a context size of 128k and modest hardware requirements. There’s no charge for requests to change the model.

Specifications

  • Hardware requirements: GPU V100 or Intel CPU 32 cores. CPU is very slow and not recommended.
  • Tested on Standard_NC24ads_A100_v4. For more support hardware support information, refer to the official release.
  • Average runtime on A100 was ~14.5% of the video duration. For short videos, the runtime can be as low as ~11.9%.

Known limitations and known issues

  • An AI language model creates the summarization feature and serves to provide a general overview. The content might not fully encapsulate the essence of the original material. It's recommended that a human review and edit the summary before use. It shouldn’t be viewed as professional or personalized advice.
  • The summary’s results are generally consistent within each flavor. However, editing the transcript or reindexing the video might lead to different results output.
  • When utilizing Flavors, the Neutral style might occasionally resemble the Formal style. The Casual style might include content-related hashtags. Additionally, a Medium length summary might be shorter than a Short summary.
  • Videos that have little content (such as very short videos) are typically not summarized to mitigate the potential model inaccuracies that can happen when the input is short.
  • The summary might occasionally include, or reference internal instructions provided to it (referred to as “meta-prompt”). It could contain directives to exclude harmful content.
  • Longer videos might result in high-level summary, and less detailed.
  • The generated summary might contain inaccuracies, such as incorrect identification of gender, age, and other personal characteristics.
  • If the original video contains inappropriate content, the video summarization output extract might be incomplete, contain disclaimers regarding the inappropriate content, and include the actual inappropriate quotes, which might be presented with or without a disclaimer.

Textual summarization with keyframes using VI enabled by Arc notes

Textual summarization with keyframes is based on keyframes selection with shots detection. Therefore, any limitation that applies to shots detection applies to textual summarization with keyframes.

Specifications

Known limitations and known issues

  • Keyframe selection is based on a proprietary AI model that might make mistakes.
  • Average runtime on A100 was ~24% of the video duration. For short videos, the runtime can be a low as ~20%.
  • Keyframe detection might not capture all the visual aspects of the video so they might be missed in the summary.
  • There's a varying limit to the number of frames that can be used for summarizing a section of a video, so frames in sections filtered by harmful content detection or other filters might be discarded. So, the summarization results might be incomplete or incorrect for some parts or sections of the video

Audio effects detection

Audio effects detection

Audio effects detection notes

  • Avoid use of short or low-quality audio, audio effects detection provides probabilistic and partial data on detected nonspeech audio events. For accuracy, audio effects detection requires at least 2 seconds of clear nonspeech audio. Voice commands or singing aren't supported.  
  • Avoid use of audio with loud background music or music with repetitive and/or linearly scanned frequency, audio effects detection is designed for nonspeech audio only and therefore can't classify events in loud music. Music with repetitive and/or linearly scanned frequency many be incorrectly classified as an alarm or siren.
  • To promote more accurate probabilistic data, ensure that:
    • Audio effects can be detected in nonspeech segments only.
    • The duration of a nonspeech section should be at least 2 seconds.
    • Low quality audio might affect the detection results.
    • Events in loud background music aren't classified.
    • Music with repetitive and/or linearly scanned frequency might be incorrectly classified as an alarm or siren.
    • Knocking on a door or slamming a door might be labeled as a gunshot or explosion.
    • Prolonged shouting or sounds of physical human effort might be incorrectly classified.
    • A group of people laughing might be classified as both laughter and crowd.
    • Natural and nonsynthetic gunshot and explosions sounds are supported.

Audio effects detection components

During the audio effects detection procedure, audio in a media file is processed, as follows:

Component Definition
Source file The user uploads the source file for indexing.
Segmentation The audio is analyzed, nonspeech audio is identified and then split into short overlapping internals.
Classification An AI process analyzes each segment and classifies its contents into event categories such as crowd reaction or laughter. A probability list is then created for each event category according to department-specific rules.
Confidence level The estimated confidence level of each audio effect is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Clapper board detection

Clapper board detection

Clapper board detection notes

  • The detection algorithm might not correctly identify the values.
  • The titles of the fields appearing on the clapper board are optimized to identify the most popular fields appearing on top of clapper boards.
  • The fields detection algorithm might not correctly identify handwritten text or digital digits.
  • The algorithm is optimized to identify fields' categories that appear horizontally.
  • The clapper board might not be detected if the frame is blurred or the text written on it can't be read by the human eye.
  • Empty fields’ values might lead to wrong fields categories.

Clapper board detection components

No components are defined.

Content moderation

See Cognitive Services Content Moderation.

Face detection and celebrity recognition

Face detection and celebrity recognition

Face detection notes

Face detection is a tool for many industries when it's used responsibly and carefully. To respect the privacy and safety of others, and to comply with local and global regulations, we recommend that you follow these use guidelines:

  • Carefully consider the accuracy of the results. To promote more accurate detection, check the quality of the video. Low-quality video might affect the insights that are presented.
  • Carefully review results if you use face detection for law enforcement. People might not be detected if they're small, sitting, crouching, or obstructed by objects or other people. To ensure fair and high-quality decisions, combine face detection-based automation with human oversight.
  • Don't use face detection for decisions that might have serious, adverse impacts. Decisions that are based on incorrect output can have serious, adverse impacts. It's advisable to include human review of decisions that have the potential for serious impacts on individuals.

Face detection components

The following table describes how images in a media file are processed during the face detection procedure:

Component Definition
Source file The user uploads the source file for indexing.
Detection and aggregation The face detector identifies the faces in each frame. The faces are then aggregated and grouped.
Recognition The celebrities model processes the aggregated groups to recognize celebrities. If you've created your own people model, it also processes groups to recognize other people. If people aren't recognized, they're labeled Unknown1, Unknown2, and so on.
Confidence value Where applicable for well-known faces or for faces that are identified in the customizable list, the estimated confidence level of each label is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82 percent certainty is represented as an 0.82 score.

Keywords extraction

Keywords extraction

Keywords extraction notes

Always upload a high-quality audio and video content. The recommended maximum frame size is HD and frame rate is 30 FPS. A frame should contain no more than 10 people. When outputting frames from videos to AI models, only send around 2 or 3 frames per second. Processing 10 and more frames might delay the AI result. At least 1 minute of spontaneous conversational speech is required to perform analysis. Audio effects are detected in nonspeech segments only. The minimal duration of a nonspeech section is 2 seconds. Voice commands and singing aren't supported.

Keywords extraction components

During the Keywords procedure, audio and images in a media file are processed, as follows:

Component Definition
Source language The user uploads the source file for indexing.
Transcription API The audio file is sent to Azure AI services and the translated transcribed output is returned. If a language has been specified, it's processed.
OCR of video Images in a media file are processed using the Azure AI Vision Read API to extract text, its location, and other insights.
Keywords extraction An extraction algorithm processes the transcribed audio. The results are then combined with the insights detected in the video during the OCR process. The keywords and where they appear in the media and then detected and identified.
Confidence level The estimated confidence level of each keyword is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Labels identification

Labels identification

Labels identification notes

  • Carefully consider the accuracy of the results, to promote more accurate detections, check the quality of the video, low quality video might affect the detected insights.
  • Carefully consider when using for law enforcement that Labels potentially can't detect parts of the video. To ensure fair and high-quality decisions, combine Labels with human oversight.
  • Don't use labels identification for decisions that might have serious adverse impacts. Machine learning models can result in undetected or incorrect classification output. Decisions based on incorrect output could have serious adverse impacts. Additionally, it's advisable to include human review of decisions that have the potential for serious impacts on individuals.

Labels identification components

During the Labels procedure, objects in a media file are processed, as follows:

Component Definition
Source The user uploads the source file for indexing.
Tagging Images are tagged and labeled. For example, door, chair, woman, headphones, jeans.
Filtering and aggregation Tags are filtered according to their confidence level and aggregated according to their category.
Confidence level The estimated confidence level of each label is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Named entities

Named entities

Named entities notes

  • Carefully consider the accuracy of the results, to promote more accurate detections, check the quality of the audio and images, low quality audio and images might impact the detected insights.
  • Named entities only detect insights in audio and images. Logos in a brand name may not be detected.
  • Carefully consider that when using for law enforcement named entities may not always detect parts of the audio. To ensure fair and high-quality decisions, always combine named entities with human oversight.
  • Don't use named entities for decisions that may have serious adverse impacts on individuals and groups. Machine learning models that extract text can result in undetected or incorrect text output. Your decisions based on incorrect output could have serious adverse impacts that must be avoided. You should always include human review of determinations that have the potential for serious impacts on individuals.

Components

During the named entities extraction procedure, the media file is processed, as follows:

Component Definition
Source file The user uploads the source file for indexing.
Text extraction - The audio file is sent to Speech Services API to extract the transcription.
- Sampled frames are sent to the Azure AI Vision API to extract OCR.
Analytics The insights are then sent to the Text Analytics API to extract the entities. For example, Microsoft, Paris or a person’s name like Paul or Sarah.
Processing and consolidation The results are then processed. Where applicable, Wikipedia links are added and brands are identified via the Video Indexer built-in and customizable branding lists.
Confidence value The estimated confidence level of each named entity is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Observed people detection and matched faces

Observed people detection and matched faces

Observed people detection and matched faces notes

  • People are generally not detected if they appear small (minimum person height is 100 pixels).
  • Maximum frame size is full high definition (FHD).
  • Low quality video (for example, dark lighting conditions) might affect the detection results.
  • The recommended frame rate at least 30 FPS.
  • Recommended video input should contain up to 10 people in a single frame. The feature could work with more people in a single frame, but the detection result retrieves up to 10 people in a frame with the detection highest confidence.
  • People with similar clothes: (for example, people wear uniforms, players in sport games) could be detected as the same person with the same ID number.
  • Obstruction – there might be errors where there are obstructions (scene/self or obstructions by other people).
  • Pose: The tracks might be split due to different poses (back/front)
  • As clothing detection is dependent on the visibility of the person’s body, the accuracy is higher if a person is fully visible. There might be errors when a person is without clothing. In this scenario or others of poor visibility, results might be given such as long pants and skirt or dress.

Observed people detection and matched faces components

Component Definition
Source file The user uploads the source file for indexing.
Detection The media file is tracked to detect observed people and their clothing. For example, shirt with long sleeves, dress or long pants. To be detected, the full upper body of the person must appear in the media.
Local grouping The identified observed faces are filtered into local groups. If a person is detected more than once, more observed faces instances are created for this person.
Matching and classification The observed people instances are matched to faces. If there's a known celebrity, the observed person is given their name. Any number of observed people instances can be matched to the same face.
Confidence value The estimated confidence level of each observed person is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Optical character recognition (OCR)

OCR

OCR notes

  • Video Indexer has an OCR limit of 50,000 words per indexed video. Once the limit is reached, no additional OCR results are generated.
  • Carefully consider the accuracy of the results, to promote more accurate detections, check the quality of the image, low quality images might affect the detected insights.
  • Carefully consider when using for law enforcement. OCR might misread or not detect parts of the text. To ensure fair and high-quality VI determinations, combine OCR-based automation with human oversight.
  • When extracting handwritten text, avoid using the OCR results of signatures that are hard to read for both humans and machines. A better way to use OCR is to use it for detecting the presence of a signature for further analysis.
  • Don't use OCR for decisions that might have serious adverse impacts to individuals or groups. Machine learning models that extract text can result in undetected or incorrect text output. Decisions based on incorrect output could have serious adverse impacts that must be avoided. You should always to include human review of decisions that have the potential for serious impacts on individuals.

OCR components

During the OCR procedure, text images in a media file are processed, as follows:

Component Definition
Source file The user uploads the source file for indexing.
Read model Images are detected in the media file and text, then extracted and analyzed by Azure AI services.
Get read results model The output of the extracted text is displayed in a JSON file.
Confidence value The estimated confidence level of each word is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Textual emotion detection

Textual emotion detection

Text-based emotion detection notes

  • This model is designed to help detect emotions in the transcript of a video. However, it isn't suitable for making assessments about an individual's emotional state, their ability, or their overall performance.
  • This emotion detection model is intended to help determine the sentiment behind sentences in the video’s transcript. However, it only works on the text itself, and might not perform well for sarcastic input or in cases where input might be ambiguous or unclear.
  • To increase the accuracy of this model, it's recommended that input data be in a clear and unambiguous format. Users should also note that this model doesn't have context about input data, which can affect its accuracy.
  • This model can produce both false positives and false negatives. To reduce the likelihood of either, users are advised to follow best practices for input data and preprocessing, and to interpret outputs in the context of other relevant information. It's important to note that the system doesn't have any context of the input data.
  • The outputs of this model should NOT be used to make assessments about an individual's emotional state or other human characteristics. This model is supported in English and might not function properly with non-English inputs. Not English inputs are being translated to English before entering the model, therefore might produce less accurate results.
  • The model should never be used to evaluate employee performance or to monitor individuals.
  • The model should never be used for making assessments about a person, their emotional state, or their ability.
  • The results of the model can be inaccurate and should be treated with caution.
  • The confidence of the model in its prediction must also be taken into account.
  • Non-English videos produce less accurate results.

Text-based emotion detection components

During the emotions detection procedure, the transcript of the video is processed, as follows:

Component Definition
Source language The user uploads the source file for indexing.
Transcription API The audio file is sent to Azure AI services and the translated transcribed output is returned. A language is processed if it's specified.
Emotions detection Each sentence is sent to the emotions detection model. The model produces the confidence level of each emotion. If the confidence level exceeds a specific threshold, and there's no ambiguity between positive and negative emotions, the emotion is detected. In any other case, the sentence is labeled as neutral.
Confidence level The estimated confidence level of the detected emotions is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Topics inference

See topics inference

Topics inference notes

  • When uploading a file, always use high-quality video content. The recommended maximum frame size is HD and frame rate is 30 FPS. A frame should contain no more than 10 people. When outputting frames from videos to AI models, only send around two or three frames per second. Processing 10 and more frames might delay the AI result.
  • When uploading a file always use high quality audio and video content. At least 1 minute of spontaneous conversational speech is required to perform analysis. Audio effects are detected in nonspeech segments only. The minimal duration of a nonspeech section is 2 seconds. Voice commands and singing aren't supported.
  • Typically, small people or objects under 200 pixels and people who are seated might not be detected. People wearing similar clothes or uniforms might be detected as being the same person and are given the same ID number. People or objects that are obstructed might not be detected. Tracks of people with front and back poses might be split into different instances.

Topics inference components

Component Definition
Source language The user uploads the source file for indexing.
Preprocessing Transcription, OCR, and facial recognition AIs extract insights from the media file.
Insights processing Topics AI analyzes the transcription, OCR, and facial recognition insights extracted during preprocessing:
- Transcribed text, each line of transcribed text insight is examined using ontology-based AI technologies.
- OCR and Facial Recognition insights are examined together using ontology-based AI technologies.
Post-processing - Transcribed text, insights are extracted and tied to a Topic category together with the line number of the transcribed text. For example, Politics in line 7.
- OCR and Facial Recognition, each insight is tied to a Topic category together with the time of the topic’s instance in the media file. For example, Freddie Mercury in the People and Music categories at 20.00.
Confidence value The estimated confidence level of each topic is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Transcription, translation, and language identification

Transcription and captions

Transcription, translation, and language identification notes

When used responsibly and carefully, Azure AI Video Indexer is a valuable tool for many industries. You must always respect the privacy and safety of others, and to comply with local and global regulations. We recommend:

  • Carefully consider the accuracy of the results, to promote more accurate data, check the quality of the audio, low quality audio might affect the detected insights.
  • Video Indexer doesn't perform speaker recognition so speakers aren't assigned an identifier across multiple files. You're unable to search for an individual speaker in multiple files or transcripts.
  • Speaker identifiers are assigned randomly and can only be used to distinguish different speakers in a single file.
  • Cross-talk and overlapping speech: When multiple speakers talk simultaneously or interrupt each other, it becomes challenging for the model to accurately distinguish and assign the correct text to the corresponding speakers.
  • Speaker overlaps: Sometimes, speakers might have similar speech patterns, accents, or use similar vocabulary, making it difficult for the model to differentiate between them.
  • Noisy audio: Poor audio quality, background noise, or low-quality recordings can hinder the model's ability to correctly identify and transcribe speakers.
  • Emotional Speech: Emotional variations in speech, such as shouting, crying, or extreme excitement, can affect the model's ability to accurately diarize speakers.
  • Speaker disguise or impersonation: If a speaker intentionally tries to imitate or disguise their voice, the model might misidentify the speaker.
  • Ambiguous speaker identification: Some segments of speech might not have enough unique characteristics for the model to confidently attribute to a specific speaker.
  • Audio that contains languages other than the ones you selected produces unexpected results.
  • The minimal segment length for detecting each language is 15 seconds.
  • The language detection offset is 3 seconds on average.
  • Speech is expected to be continuous. Frequent alternations between languages might affect the model's performance.
  • The speech of non-native speakers might affect the model's performance (for example, when speakers use their first language and they switch to another language).
  • The model is designed to recognize spontaneous conversational speech with reasonable audio acoustics (not voice commands, singing, etc.).
  • Project creation and editing aren't available for multi-language videos.
  • Custom language models aren't available when using multi-language detection.
  • Adding keywords isn't supported.
  • The language indication isn't included in the exported closed caption file.
  • The update transcript in the API doesn't support multiple languages files.
  • The model is designed to recognize a spontaneous conversational speech (not voice commands, singing, and so on).
  • If Azure AI Video Indexer can't identify the language with a high enough confidence (greater than 0.6), the fallback language is English.

Here's a list of supported languages.

Transcription, translation, and language identification components

During the transcription, translation and language identification procedure, speech in a media file is processed, as follows:

Component Definition
Source language The user uploads the source file for indexing, and either:
- Specifies the video source language.
- Selects auto detect single language (LID) to identify the language of the file. The output is saved separately.
- Selects auto detect multi language (MLID) to identify multiple languages in the file. The output of each language is saved separately.
Transcription API The audio file is sent to Azure AI services to get the transcribed and translated output. If a language is specified, it's processed accordingly. If no language is specified, a LID or MLID process is run to identify the language after which the file is processed.
Output unification The transcribed and translated files are unified into the same file. The outputted data includes the speaker ID of each extracted sentence together with its confidence level.
Confidence value The estimated confidence level of each sentence is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Next steps

Learn more about responsible AI:

Contact us

VI Support visupport@microsoft.com