Partager via


ASP.NET Distributed Speech Application Scenarios

  Microsoft Speech Technologies Homepage

This document provides a high-level overview of how the components of the Microsoft Speech Application Platform interact with one another in common usage scenarios.

Required Components

Deploying a speech-enabled Web application using SALT markup requires three components.

  1. An ASP.NET server The Web server generates Web pages containing HTML, SALT, and embedded script. The script controls the dialogue flow for voice-only interactions. For example, if there are several prompts on a page, the script defines the order in which the audio prompts play.
  2. A Speech Server Speech Server recognizes speech, and plays audio prompts and responses.
  3. A client The Speech Platform supports two types of clients: Telephony Application Services clients, and multimodal clients with a version of Internet Explorer running either Speech Add-in for Microsoft Internet Explorer or Speech Add-in for Microsoft Pocket Internet Explorer.

The following diagram illustrates these elements and the types of information they process. It also illustrates the relationship of these elements to the Visual Studio .NET 2003 Speech Development Tools.

Common Usage Scenarios

This section illustrates three deployment configurations for common deployment scenarios that the Speech Platform supports.

Telephony Scenario

In this scenario, Telephony Application Services (TAS) is the client. A telephone acts as the terminal device, and connects to TAS through a standard telephony board. The telephony board provides the interface between the telephone and TAS. At run time, TAS relies on the Web server for application logic, and on Speech Server for audio signal processing.

When the user dials a phone number for a telephony service, the call connects to TAS. TAS associates the telephone call with a voice-only SALT interpreter. Then TAS connects to the Web server and loads the default page for the application that provides the service for which the caller is dialing. As the caller interacts with the application, TAS passes audio and dual tone multi-frequency (DTMF) input from the caller to Speech Server, which performs speech recognition (SR), text-to-speech (TTS), and DTMF processing.

The SASDK includes a number of Dialog Speech Controls that support Computer-Supported Telecommunications Applications (CSTA) services. These include the AnswerCall, TransferCall, MakeCall, and DisconnectCall controls. Developers can use these controls to answer, transfer, initiate, and disconnect telephone calls, as well as gather call information, and send and receive CSTA events. The SASDK also includes a SmexMessage (Simple Messaging Extension) control that developers can use to send and receive raw CSTA messages.

Desktop Multimodal Scenario

In this scenario, the client is Microsoft Internet Explorer with Speech Add-in for Microsoft Internet Explorer installed. ASP.NET speech-enabled Web application pages reside on the Web server.

When the user enters a URL in Internet Explorer, the Web server opens the application's default page. The Web server sends HTML, SALT, and JScript to the Speech Add-in on the desktop. SALT markup in the pages that the Web server sends to the client trigger speech recognition and text-to-speech synthesis. In order to implement SALT functionality, at run time the Speech Add-in instantiates a shared SAPI SR engine. If necessary, the Speech Add-in also instantiates a TTS and a prompt engine on the client. These engines on the desktop client perform all prompting, speech recognition, and text-to-speech synthesis.

Note  Multimodal applications using a desktop client can be deployed using only the SASDK.

Windows Mobile-based Pocket PC 2003 (Pocket PC) Multimodal Scenario

In this scenario, the client is Pocket Internet Explorer with the Speech Add-in for Microsoft Pocket Internet Explorer installed. ASP.NET speech-enabled Web application pages reside on the Web server, along with the application grammars, and a configuration file containing the URL to the Speech Server that performs speech processing.

When the user enters a URL on Pocket PC, the Web server opens the application's default .aspx page. The Web server also sends the URL pointing to Speech Server. The page that the Web server sends contains HTML, SALT, and JScript. When the user taps a speech-enabled HTML element and talks, Pocket PC sends the audio to Speech Server. Along with the compressed audio, Pocket PC sends either an inline recognition grammar, or a pointer to the location of an externally-stored recognition grammar that is bound to that speech-enabled element. If the recognition grammar is an inline grammar, Speech Server loads the grammar and performs speech recognition. If the grammar is an externally-stored grammar, Speech Server first downloads a copy of the grammar, loads the grammar, and then performs speech recognition.

After the recognizer finishes, Speech Server sends Semantic Markup Language (SML) output to the Pocket PC along with audio for prompts if the application dialogue flow requires the application to play a prompt. The Pocket PC client parses the SML output, and populates the speech-enabled HTML element with the semantic value to which it is bound, and plays any prompts that Speech Server sends.

Note  Due to the nature of Pocket PC, all MIME types that the Web server sends to Pocket PC are converted to ActiveX controls before being passed to the Speech Add-in client. The client actually processes ActiveX input.

Note  If the recognition grammar is an inline grammar, Speech Server loads the grammar and performs speech recognition only if the Pocket PC user account name is listed on the Access Control List (ACL) of the Speech Server.

Note  The Speech Add-in for Microsoft Pocket Internet Explorer is available on the Microsoft Speech Server CD.