Compartilhar via


Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

Simulator Input and Output File Contents

Simulator takes as input an EMMA document and writes its output to an EMMA-formatted file. This topic describes the contents of both types of files.

Input File Contents

The input for Simulator is an EMMA document that contains one or more utterances. The EMMA document can contain:

  • A single utterance

  • An ordered sequence of utterances (similar to those in a telephone call)

  • An unordered group of utterances

  • A mixture of ordered and unordered groups of utterances

Grammar and audio inputs can be offered in the form of local file content, or as URIs. If text is entered into the field for the location of audio content, this text is used as input to be processed through emulation, rather than recognition. Transcripts contain text that represents the ideal expected recognition for the supplied audio. Transcripts may also contain tags indicating the presence of noise or garbage audio. The EMMA document may contain additional information not needed by Simulator.

EMMA files used as input to Simulator can define sets of grammars, which can be assigned collectively to recognize an utterance. Each set of grammars must have a unique identifier. For each utterance, you can designate the grammars that are active for recognition by referencing the identifier for a set of grammars. A set of grammars designated to recognize an utterance comprises the "grammar state" for an utterance. When using grammar states, each utterance should reference exactly one grammar state.

SimulatorResultsAnalyzer will always create a separate analysis section for each grammar set. This makes it easier to identify utterances that use a particular grammar state and to tune the associated grammars. Otherwise, you may specify grammars individually, using a unique identifier for each. However, SimulatorResultsAnalyzer will not create a separate analysis section for individually specified grammars; they will only be included in the default analysis that includes all utterances.

The information in an input file that is relevant to Simulator is shown in the following list:

  • Grammars

  • Groups of grammars labeled with an identifier

  • Audio files (optional)

  • Audio files with audio overridden by text input (optional)

  • Serialized engine state (optional)

  • Engine parameter settings (optional)

  • Engine version information (optional)

  • Transcripts (optional)

  • Cookies containing information about individual utterances or the entire EMMA document (optional)

Example Use of Grammar States

The following example is an EMMA input file for Simulator. The example creates two grammar states (MyState1 and MyState2) by defining two sets of grammars and assigning each set an identifier. The first utterance specifies the identifier MyState1 to designate the grammars that will be active for recognition of the utterance. The second utterance designates a different set of grammars to use for recognition by specifying MyState2.

<?xml version="1.0" encoding="utf-8"?>
<emma:emma version="1.0" xmlns:ms="https://www.microsoft.com/xmlns/webreco" 
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example">
  <!-- 
  This first block allows us to defining different grammar states, using the 
  ms namespace, and active grammar ids. 
  -->
  <emma:info>

    <!-- 
    The definition of the first 'state.' Each utterance references a set of one 
    or more active grammars, based on an id. In this way, a user can run offline 
    batch recognitions on a set of utterances that may have different 'states' 
    defined by active grammars and their rules. The analysis results will be 
    bucketed based on the active-grammars id. 
    -->
    <ms:active-grammar-set id="MyState1">
      <ms:grammar weight="1.1" emma:grammar-ref="grammar-0"/>
      <ms:grammar weight="1.2" emma:grammar-ref="grammar-1"/>
      <ms:grammar weight="1.3" emma:grammar-ref="grammar-2"/>
      <ms:grammar weight="1.4" emma:grammar-ref="grammar-3"/>
      <ms:grammar weight="1.5" emma:grammar-ref="grammar-4"/>
      <ms:grammar weight="1.6" emma:grammar-ref="grammar-5"/>
    </ms:active-grammar-set>

    <!-- The definition of the second set of 'state.' -->
    <ms:active-grammar-set id="MyState2">
      <ms:grammar weight="1.4" emma:grammar-ref="grammar-6"/>
      <ms:grammar weight="1.5" emma:grammar-ref="grammar-7"/>
      <ms:grammar weight="1.6" emma:grammar-ref="grammar-8"/>
    </ms:active-grammar-set>

  </emma:info>

  <emma:grammar id="grammar-0" ref="https://contoso.com/grammars/Grammar0.grxml />
  <emma:grammar id="grammar-1" ref="https://contoso.com/grammars/Grammar1.grxml />
  <emma:grammar id="grammar-2" ref="https://contoso.com/grammars/Grammar2.grxml />
  <emma:grammar id="grammar-3" ref="https://contoso.com/grammars/Grammar3.grxml />
  <emma:grammar id="grammar-4" ref="https://contoso.com/grammars/Grammar4.grxml />
  <emma:grammar id="grammar-5" ref="https://contoso.com/grammars/Grammar5.grxml />
  <emma:grammar id="grammar-6" ref="https://contoso.com/grammars/Grammar6.grxml />
  <emma:grammar id="grammar-7" ref="https://contoso.com/grammars/Grammar7.grxml />
  <emma:grammar id="grammar-8" ref="https://contoso.com/grammars/Grammar8.grxml />

  <!-- 
  The following emma:group contains the set of utterances to be recognized against. 
  In this example, there are two utterances, each to be recognized by a different 
  'state' of active grammars. 
  -->
  <emma:group id="SetofUtterances">

    <emma:group id="MyFirstUtterance">
      <emma:info>
        <ms:audio ref="http://capture.contoso.com/call/utt01.wav"/>
        <ms:active-grammar-set-ref ref="MyState1"/>
        <ms:transcript>
          <ms:original> sports </ms:original>
        </ms:transcript>
      </emma:info>
    </emma:group>

    <emma:group id="MySecondUtterance">
      <emma:info>
        <ms:audio ref="http://capture.contoso.com/call/utt02.wav" type="audio/x-wav" />
        <ms:active-grammar-set-ref ref="MyState2"/>
        <ms:transcript>
          <ms:original> weather please </ms:original>
        </ms:transcript>
      </emma:info>
    </emma:group>

  </emma:group>

</emma:emma>

Remarks

The EMMA structure specifies whether to reuse the serialized state based on whether recognized utterances are organized within emma:sequence elements or not. The serialized state for a recognition, also known as the recognition context block, contains user-specific speech engine adaptations.

If you specify utterances using the emma:sequence element in the EMMA structure, then all utterances contained in the first level emma:group element beneath the emma:sequence element will maintain serialized state, passing information learned about one utterance to the next utterance in the emma:sequence. If you do not use emma:sequence elements, the serialized state will not be reused and the recognizer will treat the utterances as a collection of unrelated calls.

When configured for local recognition, Simulator will send cookie information with the request for the grammar. You can add cookies for local or remote grammars that are referenced with a link to an HTTP or HTTPS URL address. Use the ms:cookie-jar element to add cookies to an emma:info element contained either within the global emma:emma element, or within an utterance's emma:group element. An emma:info element can contain only one ms:cookie-jar element.

The global scope and utterance scope of an EMMA document, and the recognition engine configuration file scope may all contribute unique cookies to the final collection of cookies. If a new cookie has the identical name, domain, and path as an older cookie, the new cookie will replace the older cookie.

For example, if the recognition engine configuration file specifies two unique cookies, and the global emma:info section of the EMMA document specifies two more unique cookies, and the emma:info section of the utterance also specifies two more unique cookies, then the collection of cookies will contain six cookies total. See Speech Recognition Engine Configuration File Settings for more information about cookies.

Output File Contents

Simulator creates EMMA-formatted output files that may contain all the information from the input file, as well as the following information:

  • Detailed recognition results (that is, the actual recognition, confidence values, semantics, pronunciation, lattice, rule tree, and other recognition results)

  • A unique RequestId and SessionId for the recognition