Share via


Lesson 9 -- Text and Audio in Prompts

We have been using the <prompt> element throughout this tutorial without looking at it in detail. In this lesson, we examine the <prompt> element in more depth. In particular, we look at:

  • Attributes of the <prompt> element

  • Speech markup

  • Omitting the <prompt>...</prompt> tags

  • Using audio files

  • A version of app-root.vxml that uses audio files

The <prompt> elements that contain the questions posed to callers can contain text or audio files. When the <prompt> element contains text alone, it is rendered by text-to-speech (TTS). In prior lessons we have exclusively used prompts that deliver TTS.

As an alternative to text rendering using TTS, the element can reference an audio file. The application then plays the audio file rather than using TTS.

Attributes of the <prompt> element

The <prompt> element has seven attributes, all of which are optional:

  • bargein — Controls whether a user can interrupt a prompt. This was covered in Lesson 4.

  • bargeintype — The way in which the prompt is stopped if bargein is allowed. This was covered in Lesson 4.

  • cond — A JavaScript expression that must evaluate to true if the prompt is to be played.

  • count — The value of the prompt counter for this prompt. Using this counter, the application can play different prompts depending on how many times the caller has been prompted.

  • timeout — The number of seconds (s) or milliseconds (ms) that the platform waits for caller input after the prompt is finished playing before throwing a noinput event. If omitted, the value is determined by the timeout property (Tellme platform default is 7s).

  • xml:lang — The language identifier for the prompt. If omitted, it defaults to the value specified in the document's xml:lang attribute.

  • xml:base – Declares the base URI from which relative URIs in the prompt are resolved. This base declaration has precedence over the <vxml> base URI declaration. If a local declaration is omitted, the value is inherited down the document hierarchy.

Refer to http://www.w3.org/TR/2004/REC-voicexml20-20040316/#dml4.1 and https://msdn.microsoft.com/en-us/library/ff928996.aspx for more information about these attributes.

Speech markup

In Lesson 4 we introduced the <break> element that is used to put a pause in the TTS rendered by a prompt. This <break> element is one of 14 "speech markup" elements in VoiceXML. These elements can all appear in a <prompt> element. Valid markup elements include:

Markup Element

Description

<audio>

Specifies audio files to be played and text to be spoken.

<break>

Specifies a pause in the speech output.

<emphasis>

Specifies that the enclosed text should be spoken with emphasis.

<phoneme>

Specifies a phonetic pronunciation for the contained text.

<prosody>

Specifies rate and volume information for the enclosed text.

<say-as>

Specifies how to say the type of text contained within the element.

<voice>

Specifies voice characteristics for the spoken text.

The Tellme platform's implementation of the speech markup elements is detailed in the VoiceXML 2.x Element Catalog (https://msdn.microsoft.com/en-us/library/ff928995.aspx).

With the exception of <audio>, which we will discuss separately below, all of the speech markup elements in the table above are used to modify generated TTS. Here are a few examples:

  • When a word or phrase is enclosed in <emphasis>...</emphasis> tags, it is stressed relative to the other words in the text. The <emphasis> element has an optional attribute level, which can have the values strong, moderate (default), none, and reduced. You can use level="none" and level="reduced" to de-emphasize words that are too strong in the generated TTS.

  • You can enclose a word or phrase in <phoneme>...</phoneme> tags to force the TTS generator to pronounce the words phonetically. Using this element is complicated. See https://msdn.microsoft.com/en-us/library/ff929020.aspx for details.

  • You can control the speaking rate and volume of the speech output with the <prosody> element. <prosody> has two attributes: rate (can be slow, medium, or fast), and volume (can be silent, soft, medium, or loud).

  • The <say-as> element aids the TTS engine in pronunciation by resolving ambiguities regarding the meaning of the contained text. The type attribute can have values such as address, time:hm (time in hours and minutes), date:mdy (a date formatted as mm/dd/yyyy.), and a number of others. As an example: <say-as type="currency">$20.45</say-as>. For more information on the <say-as> element, see https://msdn.microsoft.com/en-us/library/ff929005.aspx.

  • The <voice> element controls the vocal characteristics for the contained text when played back by the TTS engine. It has two attributes: gender and name. gender can be male or female, while name can be tom (male) or zira (female). There may be more named voices on the Tellme platform in the future—look for them at https://msdn.microsoft.com/en-us/library/ff929005.aspx.

Omitting the <prompt>...</prompt> tags

We have noted several times previously in this tutorial that unenclosed text found in <field>, <block>, and several other elements will automatically be rendered as TTS. What this means, quite simply, is that you do not always need to enclose a prompt in <prompt>...</prompt> tags.

You can omit the <prompt> ... </prompt> tags if:

  1. There is no need to specify a prompt attribute (such as bargein). In such a case, all the <prompt> attributes have their default values.

  2. The prompt consists entirely of plain text (PCDATA) or consists of just an <audio> or <value> element.

  3. The prompt does not include speech markup elements such as <break>.

For example, these are all legal prompts:

  • Please say your city.

  • <audio src="say_your_city.wav"/>

  • I think I heard you say <value expr="myCity"/>

Note

Although it would be unusual, just <value expr="myCity"/> all by itself would also be a valid prompt.

Where can you use the <prompt> element

Legal parents for the <prompt> element are:

  • block

  • catch

  • error

  • field

  • filled

  • foreach

  • help

  • if

  • initial

  • menu

  • noinput

  • nomatch

  • object

  • prompt

  • record

  • subdialog

  • transfer

Note

In any of the above listed elements, plain text, when not enclosed in tags and not including any elements other than <value>, will automatically be rendered as prompts with TTS.

We saw several examples of this in Lesson 8, where we enclosed prompt text in <noinput>, <nomatch>, and <help> elements.

Using audio files

The <audio> element can be used to play music or voice .wav files. For example:

<audio src="company_theme_song.wav" />

<audio src="welcome_statement.wav"/>

<audio src="main_prompt.wav"/>

See https://msdn.microsoft.com/en-us/library/ff929030.aspx for details about the <audio> element and the file formats it supports.

We have been using TTS for all of the prompts in app-root.vxml and have had some success in using voice markup to make the prompts sound better. Refining a complex TTS sentence so that it sounds like a person can be quite complicated.

The best prompts are those that play professionally produced audio files.

In this lesson, we will replace most of our prompts with audio files that have been made with Tellme Studio's "record by phone" utility (see https://studio.tellme.com/help/whatis_recordbyphone.html). These files were recorded by the author of this tutorial and are most definitely not of professional quality. Even so, they are more natural sounding than the TTS prompts. They are introduced here to show you how to use recorded prompts.

The https://studio.tellme.com/vxml-tutorial directory, which is our xml:base path, contains the audio files that we will use in Lesson 9.

To play an audio file prompt, we can simply include <audio src="main_prompt.wav"/> within the <prompt>...</prompt> tags where we would have otherwise placed the text to be rendered as TTS.

The recommended approach, however, is to include both an audio file and text for TTS. Then, if the application cannot fetch the audio file for some reason, the text is rendered as TTS. If the audio file is fetched and played, the text is ignored. Here is a new version of our main_selection prompt, as an example:

 
   <prompt bargein="true" bargeintype="speech">
      <audio  src="main_prompt.wav">
         Welcome to Contoso Travel<break/>
         Say new reservation<break size="small"/> or press 1<break/>
         Say change reservation<break size="small"/> or press 2<break/>
         Say restaurant recommendation<break size="small"/> or press 3
      </audio>
   </prompt>

A version of app-root.vxml that uses audio files

In this version of app-root.vxml we have replaced much of the TTS with audio files. Note several things:

  1. We have added a new property (audiofetchhint) with the value prefetch, so that the audio files are loaded in advance of their being needed. Otherwise, there might be delays in playing the prompts, caused by waiting for them to be loaded.

  2. We wrapped the <value expr="caller_response"/> element with <voice gender="male" name="tom">...</voice> speech markup tags. Otherwise the <value expr="caller_response"/> rendering in TTS would have been in the default woman's voice. As it now is, the male voice "tom" is a bit different than the recorded audio voice, but using audio files for caller_response would have been too complicated for this tutorial.

Here is the code, with the changes in bold font:

<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns="http://www.w3.org/2001/vxml" version="2.1" xml:lang="en-US"
      revision="4"
      xml:base="https://studio.tellme.com/vxml-tutorial/">
      
   <property name="audiofetchhint" value="prefetch"/>

   <script><![CDATA[

      var doNewPlaneRes = false;

      var doNewHotelRes = false;
      var doNewCarRes = false;
      var doChangePlaneRes = false;
      var doChangeHotelRes = false;
      var doChangeCarRes = false;
      var doRestaurantRec = false;

   ]]></script>

   <link event="help">

      <grammar mode="voice"
            root="root_rule"
            version="1.0"
            xml:lang="en-US">
         <rule id="root_rule">
               <item>help</item>
         </rule>
      </grammar>
   </link>

   <link next="#exit">
      <grammar mode="voice"
         root="root_rule"
         version="1.0"
         xml:lang="en-US">
         <rule id="root_rule">
            <one-of>
               <item>exit</item>
               <item>stop</item>
               <item>quit</item>
            </one-of>
         </rule>
      </grammar>
   </link>

   <form id="main">
      <field name="main_selection">
         <prompt bargein="true" bargeintype="speech">
            <audio  src="main_prompt.wav">
            Welcome to Contoso Travel<break/>
            Say new reservation<break size="small"/> or press 1<break/>
            Say change reservation<break size="small"/> or press 2<break/>
            Say restaurant recommendation<break size="small"/> or press 3
            </audio>
         </prompt>

         <grammar version="1.0" root="top" tag-format="semantics/1.0">
            <rule id="top">
               <item><ruleref uri="#nonsense"/></item>
                  <one-of>
                     <item>new reservation
                     <tag>out="new reservation";</tag></item>

                     <item>change reservation
                     <tag>out="change reservation";</tag></item>

                     <item>restaurant recommendation
                     <tag>out="restaurant recommendation";</tag></item>
               </one-of>
               <item><ruleref uri="#nonsense"/></item>
            </rule>
            <rule id="nonsense">
               <one-of>
                  <item><ruleref special="GARBAGE"/></item>
                  <item><ruleref special="NULL"/></item>
               </one-of>
            </rule>
         </grammar>

         <grammar mode="dtmf" version="1.0" root="top">
            <rule id="top">
               <one-of>
                  <item>1</item>
                  <item>2</item>
                  <item>3</item>
               </one-of>
            </rule>
         </grammar>

         <help>
            <audio src="main_help.wav">
            You should say one of these three phrases exactly.
            New reservation, change reservation, or
            restaurant recommendation.
            </audio>
         </help>


         <catch event="noinput nomatch">
            <audio src="noinput_1.wav">
            Sorry. Didn't get that. Please try again.
            </audio>
            <reprompt/>
         </catch>

         <catch event="noinput nomatch" count="3">
            <audio src="noinput_3.wav">
               Sorry you're having trouble. Please call back later. Goodbye.
            </audio>
            <exit/>
         </catch>

         <filled>
            <var name="next_destination" expr="' '"/>
            <if cond="main_selection == 'new reservation' ||
                      main_selection == '1'">
               <assign name="next_destination"
                       expr="'#new_reservation'"/>
            <elseif cond="main_selection == 'change reservation' ||
                          main_selection == '2'"/>
               <assign name ="next_destination"
                       expr="'change-reservation.vxml'"/>
            <elseif cond="main_selection == 'restaurant recommendation' ||
                          main_selection == '3'"/>
               <assign name ="next_destination" expr="'restaurant.vxml'"/>
            <else/>
               <assign name ="next_destination" expr="error.vxml"/>
            </if>
            <goto expr="next_destination"/>
         </filled>
      </field>
   </form>

   <form id ="new_reservation">
      <field name="new_reservation_type">
         <prompt bargein="true" bargeintype="speech">
            <audio src="res_type_prompt.wav">
            What type of reservation do you want to make<break/>Say
            plane<break size="small"/>, hotel
            <break size="small"/>, or car<break/>
            You may choose more than one.
            </audio>
         </prompt>
         <grammar version="1.0" root="top" tag-format="semantics/1.0">
            <rule id="top">
               <one-of>
            <!--single choice -->
                  <item><ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane";</tag></item>

                  <item><ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>
                  <tag>out="hotel";</tag></item>

                  <item><ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>
                  <tag>out="car";</tag></item>

            <!--double choice -->
                  <item><ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and hotel";</tag></item>

                  <item><ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and hotel";</tag></item>

                  <item><ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and car";</tag></item>

                  <item><ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and car";</tag></item>

                  <item><ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>
                  <tag>out="hotel and car";</tag></item>

                  <item><ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>
                  <tag>out="hotel and car";</tag></item>

            <!--triple choice -->
                  <item><ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and hotel and car";</tag></item>

                  <item><ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and hotel and car";</tag></item>

                  <item><ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and hotel and car";</tag></item>

                  <item><ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and hotel and car";</tag></item>

                  <item><ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and hotel and car";</tag></item>

                  <item><ruleref uri="#nonsense"/>car
                  <ruleref uri="#nonsense"/>hotel
                  <ruleref uri="#nonsense"/>plane
                  <ruleref uri="#nonsense"/>
                  <tag>out="plane and hotel and car";</tag></item>
               </one-of>
            </rule>
            <rule id="nonsense">
               <one-of>
                  <item><ruleref special="GARBAGE"/></item>
                  <item><ruleref special="NULL"/></item>
               </one-of>
            </rule>
         </grammar>

         <help>
            <audio src="res_type_help.wav">
            Just say one word for each choice. Choices are plane,
                                                  hotel, and car.
            </audio>
         </help>

         <catch event="noinput nomatch">
            <audio src="noinput_1.wav">
            Sorry. Didn't get that. Please try again.
            </audio>
            <reprompt/>
         </catch>

         <catch event="noinput nomatch" count="3">
            <audio src="noinput_3.wav">
               Sorry you're having trouble. Please call back later. Goodbye.
            </audio>
            <exit/>
         </catch>
      </field>

      <subdialog name="objAnswer" src="#confirmation">
         <param name="caller_response" expr="new_reservation_type"/>
         <filled>
            <if cond="objAnswer.confirmation == 'no'">
              <clear namelist="new_reservation_type objAnswer"/>

            <else/>
               <var name="next_destination" expr="' '"/>
               <if cond="new_reservation_type == 'plane'">
                  <assign name ="next_destination" expr="'new-plane.vxml'"/>
               <elseif cond="new_reservation_type == 'hotel'"/>
                  <assign name ="next_destination" expr="'new-hotel.vxml'"/>
               <elseif cond="new_reservation_type == 'car'"/>
                  <assign name ="next_destination" expr="'new-car.vxml'"/>
               <elseif cond="new_reservation_type == 'plane and hotel'"/>{
                  <assign name ="next_destination" expr="'new-plane.vxml'"/>
                  <assign name ="doNewHotelRes" expr="true"/>
               }
               <elseif cond="new_reservation_type == 'plane and car'"/>{
                  <assign name ="next_destination" expr="'new-plane.vxml'"/>
                  <assign name ="doNewCarRes" expr="true"/>
               }
               <elseif cond="new_reservation_type == 'hotel and car'"/>{
                  <assign name ="next_destination" expr="'new-hotel.vxml'"/>
                  <assign name ="doNewCarRes" expr="true"/>
               }
               <elseif cond="new_reservation_type == 'plane and hotel and car'"/>{
                  <assign name ="next_destination" expr="'new-plane.vxml'"/>
                  <assign name ="doNewHotelRes" expr="true"/>
                  <assign name ="doNewCarRes" expr="true"/>
               }
               <else/>
                  <assign name ="next_destination" expr="'error.vxml'"/>
               </if>

               <goto expr="next_destination"/>
            </if>
         </filled>
      </subdialog>
   </form>

   <form id="confirmation">
      <var name="caller_response"/>
      <field name="confirmation">
         <prompt>
            <audio src="i_heard.wav">
            I think I heard you say
            </audio>
            <voice gender="male" name="tom">
               <value expr="caller_response"/>
            </voice>
            <audio src="is_that_correct.wav">
            Is that correct?
            </audio>
         </prompt>
         <grammar src=
           "http://grammar.svc.tellme.com/yesno/mss/v2/confirm.grxml"/>

         <catch event="noinput nomatch">
            <audio src="noinput_1.wav">
            Sorry. Didn't get that. Please try again.
            </audio>
            <reprompt/>
         </catch>

         <catch event="noinput nomatch" count="3">
            <audio src="noinput_3.wav">
               Sorry you're having trouble. Please call back later. Goodbye.
            </audio>
            <exit/>
         </catch>

         <filled>
            <return namelist="confirmation"/>
            <clear namelist="confirmation"/>
         </filled>
      </field>
   </form>
   
    <form id="exit">
      <block>
         <exit/>
      </block>
   </form>

</vxml>

What's next?

In Lesson 9 we have started using audio files for prompts instead of always relying on TTS. These audio files were made using Tellme Studio's record by phone utility. They are amateurish, but even so, they sound more natural than the TTS. You can imagine what is possible with professionally produced audio files.

The changes to app-root.vxml at the end of the lesson are the last we will make in this tutorial.

Lesson 10, which is the last lesson in this tutorial, will cover the important subject of "mixed initiative forms." We will not, however, add a mixed initiative form to app-root.vxml.