Add real-time transcription into your application

Article
02/07/2025

Important

Functionality described in this article is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

This guide helps you better understand the different ways you can use Azure Communication Services offering of real-time transcription through Call Automation SDKs.

Prerequisites

Azure account with an active subscription, for details see Create an account for free.
Azure Communication Services resource, see Create an Azure Communication Services resource
Create and connect Azure AI services to your Azure Communication Services resource.
Create a custom subdomain for your Azure AI services resource.
Create a new web service application using the Call Automation SDK.

Setup a WebSocket Server

Azure Communication Services requires your server application to set up a WebSocket server to stream transcription in real-time. WebSocket is a standardized protocol that provides a full-duplex communication channel over a single TCP connection. You can optionally use Azure services Azure WebApps that allows you to create an application to receive transcripts over a websocket connection. Follow this quickstart.

Establish a call

In this quickstart, we assume that you're already familiar with starting calls. If you need to learn more about starting and establishing calls, you can follow our quickstart. For the purposes of this quickstart, we're going through the process of starting transcription for both incoming calls and outbound calls.

When working with real-time transcription, you have a few of options on when and how to start transcription:

Option 1 - Starting at time of answering or creating a call

Option 2 - Starting transcription during an ongoing call

Option 3 - Starting transcription when connecting to an Azure Communication Services Rooms call

In this tutorial, we're demonstrating option 2 and 3, starting transcription during an ongoing call or when connecting to a Rooms call. By default the 'startTranscription' is set to false at time of answering or creating a call.

Create a call and provide the transcription details

Define the TranscriptionOptions for ACS to specify when to start the transcription, specify the locale for transcription, and the web socket connection for sending the transcript.

var createCallOptions = new CreateCallOptions(callInvite, callbackUri)
{
    CallIntelligenceOptions = new CallIntelligenceOptions() { CognitiveServicesEndpoint = new Uri(cognitiveServiceEndpoint) },
    TranscriptionOptions = new TranscriptionOptions(new Uri(""), "en-US", false, TranscriptionTransport.Websocket)
};
CreateCallResult createCallResult = await callAutomationClient.CreateCallAsync(createCallOptions);

Connect to a Rooms call and provide transcription details

If you're connecting to an ACS room and want to use transcription, configure the transcription options as follows:

var transcriptionOptions = new TranscriptionOptions(
    transportUri: new Uri(""),
    locale: "en-US", 
    startTranscription: false,
    transcriptionTransport: TranscriptionTransport.Websocket
);

var connectCallOptions = new ConnectCallOptions(new RoomCallLocator("roomId"), callbackUri)
{
    CallIntelligenceOptions = new CallIntelligenceOptions() 
    { 
        CognitiveServicesEndpoint = new Uri(cognitiveServiceEndpoint) 
    },
    TranscriptionOptions = transcriptionOptions
};

var connectResult = await client.ConnectCallAsync(connectCallOptions);

Start Transcription

Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.

// Start transcription with options
StartTranscriptionOptions options = new StartTranscriptionOptions()
{
    OperationContext = "startMediaStreamingContext",
    //Locale = "en-US",
};

await callMedia.StartTranscriptionAsync(options);

// Alternative: Start transcription without options
// await callMedia.StartTranscriptionAsync();

Receiving Transcription Stream

When transcription starts, your websocket receives the transcription metadata payload as the first packet.

{
    "kind": "TranscriptionMetadata",
    "transcriptionMetadata": {
        "subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
        "locale": "en-us",
        "callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
        "correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
    }
}

Receiving Transcription data

After the metadata, the next packets your web socket receives will be TranscriptionData for the transcribed audio.

{
    "kind": "TranscriptionData",
    "transcriptionData": {
        "text": "Testing transcription.",
        "format": "display",
        "confidence": 0.695223331451416,
        "offset": 2516998782481234400,
        "words": [
            {
                "text": "testing",
                "offset": 2516998782481234400
            },
            {
                "text": "testing",
                "offset": 2516998782481234400
            }
        ],
        "participantRawID": "8:acs:",
        "resultStatus": "Final"
    }
}

Handling transcription stream in the web socket server

using WebServerApi;

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();

var app = builder.Build();
app.UseWebSockets();
app.Map("/ws", async context =>
{
    if (context.WebSockets.IsWebSocketRequest)
    {
        using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
        await HandleWebSocket.Echo(webSocket);
    }
    else
    {
        context.Response.StatusCode = StatusCodes.Status400BadRequest;
    }
});

app.Run();

Updates to your code for the websocket handler

using Azure.Communication.CallAutomation;
using System.Net.WebSockets;
using System.Text;

namespace WebServerApi
{
    public class HandleWebSocket
    {
        public static async Task Echo(WebSocket webSocket)
        {
            var buffer = new byte[1024 * 4];
            var receiveResult = await webSocket.ReceiveAsync(
                new ArraySegment(buffer), CancellationToken.None);

            while (!receiveResult.CloseStatus.HasValue)
            {
                string msg = Encoding.UTF8.GetString(buffer, 0, receiveResult.Count);
                var response = StreamingDataParser.Parse(msg);

                if (response != null)
                {
                    if (response is AudioMetadata audioMetadata)
                    {
                        Console.WriteLine("***************************************************************************************");
                        Console.WriteLine("MEDIA SUBSCRIPTION ID-->"+audioMetadata.MediaSubscriptionId);
                        Console.WriteLine("ENCODING-->"+audioMetadata.Encoding);
                        Console.WriteLine("SAMPLE RATE-->"+audioMetadata.SampleRate);
                        Console.WriteLine("CHANNELS-->"+audioMetadata.Channels);
                        Console.WriteLine("LENGTH-->"+audioMetadata.Length);
                        Console.WriteLine("***************************************************************************************");
                    }
                    if (response is AudioData audioData)
                    {
                        Console.WriteLine("***************************************************************************************");
                        Console.WriteLine("DATA-->"+audioData.Data);
                        Console.WriteLine("TIMESTAMP-->"+audioData.Timestamp);
                        Console.WriteLine("IS SILENT-->"+audioData.IsSilent);
                        Console.WriteLine("***************************************************************************************");
                    }

                    if (response is TranscriptionMetadata transcriptionMetadata)
                    {
                        Console.WriteLine("***************************************************************************************");
                        Console.WriteLine("TRANSCRIPTION SUBSCRIPTION ID-->"+transcriptionMetadata.TranscriptionSubscriptionId);
                        Console.WriteLine("LOCALE-->"+transcriptionMetadata.Locale);
                        Console.WriteLine("CALL CONNECTION ID--?"+transcriptionMetadata.CallConnectionId);
                        Console.WriteLine("CORRELATION ID-->"+transcriptionMetadata.CorrelationId);
                        Console.WriteLine("***************************************************************************************");
                    }
                    if (response is TranscriptionData transcriptionData)
                    {
                        Console.WriteLine("***************************************************************************************");
                        Console.WriteLine("TEXT-->"+transcriptionData.Text);
                        Console.WriteLine("FORMAT-->"+transcriptionData.Format);
                        Console.WriteLine("OFFSET-->"+transcriptionData.Offset);
                        Console.WriteLine("DURATION-->"+transcriptionData.Duration);
                        Console.WriteLine("PARTICIPANT-->"+transcriptionData.Participant.RawId);
                        Console.WriteLine("CONFIDENCE-->"+transcriptionData.Confidence);

                        foreach (var word in transcriptionData.Words)
                        {
                            Console.WriteLine("TEXT-->"+word.Text);
                            Console.WriteLine("OFFSET-->"+word.Offset);
                            Console.WriteLine("DURATION-->"+word.Duration);
                        }
                        Console.WriteLine("***************************************************************************************");
                    }
                }

                await webSocket.SendAsync(
                    new ArraySegment(buffer, 0, receiveResult.Count),
                    receiveResult.MessageType,
                    receiveResult.EndOfMessage,
                    CancellationToken.None);

                receiveResult = await webSocket.ReceiveAsync(
                    new ArraySegment(buffer), CancellationToken.None);
            }

            await webSocket.CloseAsync(
                receiveResult.CloseStatus.Value,
                receiveResult.CloseStatusDescription,
                CancellationToken.None);
        }
    }
}

Update Transcription

For situations where your application allows users to select their preferred language you may also want to capture the transcription in that language. To do this, Call Automation SDK allows you to update the transcription locale.

await callMedia.UpdateTranscriptionAsync("en-US-NancyNeural");

Stop Transcription

When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.

StopTranscriptionOptions stopOptions = new StopTranscriptionOptions()
{
    OperationContext = "stopTranscription"
};

await callMedia.StopTranscriptionAsync(stopOptions);

Create a call and provide the transcription details

Define the TranscriptionOptions for ACS to specify when to start the transcription, the locale for transcription, and the web socket connection for sending the transcript.

CallInvite callInvite = new CallInvite(target, caller); 

CallIntelligenceOptions callIntelligenceOptions = new CallIntelligenceOptions()
    .setCognitiveServicesEndpoint(appConfig.getCognitiveServiceEndpoint()); 

TranscriptionOptions transcriptionOptions = new TranscriptionOptions(
    appConfig.getWebSocketUrl(), 
    TranscriptionTransport.WEBSOCKET, 
    "en-US", 
    false
); 

CreateCallOptions createCallOptions = new CreateCallOptions(callInvite, appConfig.getCallBackUri());
createCallOptions.setCallIntelligenceOptions(callIntelligenceOptions); 
createCallOptions.setTranscriptionOptions(transcriptionOptions); 

Response result = client.createCallWithResponse(createCallOptions, Context.NONE); 
return result.getValue().getCallConnectionProperties().getCallConnectionId();

Connect to a Rooms call and provide transcription details

If you're connecting to an ACS room and want to use transcription, configure the transcription options as follows:

TranscriptionOptions transcriptionOptions = new TranscriptionOptions(
    appConfig.getWebSocketUrl(), 
    TranscriptionTransport.WEBSOCKET, 
    "en-US", 
    false
);

ConnectCallOptions connectCallOptions = new ConnectCallOptions(new RoomCallLocator("roomId"), appConfig.getCallBackUri())
    .setCallIntelligenceOptions(
        new CallIntelligenceOptions()
            .setCognitiveServicesEndpoint(appConfig.getCognitiveServiceEndpoint())
    )
    .setTranscriptionOptions(transcriptionOptions);

ConnectCallResult connectCallResult = Objects.requireNonNull(client
    .connectCallWithResponse(connectCallOptions)
    .block())
    .getValue();

Start Transcription

Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.

//Option 1: Start transcription with options
StartTranscriptionOptions transcriptionOptions = new StartTranscriptionOptions()
    .setOperationContext("startMediaStreamingContext"); 

client.getCallConnection(callConnectionId)
    .getCallMedia()
    .startTranscriptionWithResponse(transcriptionOptions, Context.NONE); 

// Alternative: Start transcription without options
// client.getCallConnection(callConnectionId)
//     .getCallMedia()
//     .startTranscription();

Receiving Transcription Stream

When transcription starts, your websocket receives the transcription metadata payload as the first packet.

{
    "kind": "TranscriptionMetadata",
    "transcriptionMetadata": {
        "subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
        "locale": "en-us",
        "callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
        "correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
    }
}

Receiving Transcription data

After the metadata, the next packets your web socket receives will be TranscriptionData for the transcribed audio.

{
    "kind": "TranscriptionData",
    "transcriptionData": {
        "text": "Testing transcription.",
        "format": "display",
        "confidence": 0.695223331451416,
        "offset": 2516998782481234400,
        "words": [
            {
                "text": "testing",
                "offset": 2516998782481234400
            },
            {
                "text": "testing",
                "offset": 2516998782481234400
            }
        ],
        "participantRawID": "8:acs:",
        "resultStatus": "Final"
    }
}

Handling transcription stream in the web socket server

package com.example;

import org.glassfish.tyrus.server.Server;

import java.io.BufferedReader;
import java.io.InputStreamReader;

public class App {
    public static void main(String[] args) {
        Server server = new Server("localhost", 8081, "/ws", null, WebSocketServer.class);

        try {
            server.start();
            System.out.println("Web socket running on port 8081...");
            System.out.println("wss://localhost:8081/ws/server");
            BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
            reader.readLine();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            server.stop();
        }
    }
}

Updates to your code for the websocket handler

package com.example;

import javax.websocket.OnMessage;
import javax.websocket.Session;
import javax.websocket.server.ServerEndpoint;

import com.azure.communication.callautomation.models.streaming.StreamingData;
import com.azure.communication.callautomation.models.streaming.StreamingDataParser;
import com.azure.communication.callautomation.models.streaming.media.AudioData;
import com.azure.communication.callautomation.models.streaming.media.AudioMetadata;
import com.azure.communication.callautomation.models.streaming.transcription.TranscriptionData;
import com.azure.communication.callautomation.models.streaming.transcription.TranscriptionMetadata;
import com.azure.communication.callautomation.models.streaming.transcription.Word;

@ServerEndpoint("/server")
public class WebSocketServer {
    @OnMessage
    public void onMessage(String message, Session session) {
        StreamingData data = StreamingDataParser.parse(message);

        if (data instanceof AudioMetadata) {
            AudioMetadata audioMetaData = (AudioMetadata) data;
            System.out.println("----------------------------------------------------------------");
            System.out.println("SUBSCRIPTION ID: --> " + audioMetaData.getMediaSubscriptionId());
            System.out.println("ENCODING: --> " + audioMetaData.getEncoding());
            System.out.println("SAMPLE RATE: --> " + audioMetaData.getSampleRate());
            System.out.println("CHANNELS: --> " + audioMetaData.getChannels());
            System.out.println("LENGTH: --> " + audioMetaData.getLength());
            System.out.println("----------------------------------------------------------------");
        }

        if (data instanceof AudioData) {
            AudioData audioData = (AudioData) data;
            System.out.println("----------------------------------------------------------------");
            System.out.println("DATA: --> " + audioData.getData());
            System.out.println("TIMESTAMP: --> " + audioData.getTimestamp());
            System.out.println("IS SILENT: --> " + audioData.isSilent());
            System.out.println("----------------------------------------------------------------");
        }

        if (data instanceof TranscriptionMetadata) {
            TranscriptionMetadata transcriptionMetadata = (TranscriptionMetadata) data;
            System.out.println("----------------------------------------------------------------");
            System.out.println("TRANSCRIPTION SUBSCRIPTION ID: --> " + transcriptionMetadata.getTranscriptionSubscriptionId());
            System.out.println("IS SILENT: --> " + transcriptionMetadata.getLocale());
            System.out.println("CALL CONNECTION ID: --> " + transcriptionMetadata.getCallConnectionId());
            System.out.println("CORRELATION ID: --> " + transcriptionMetadata.getCorrelationId());
            System.out.println("----------------------------------------------------------------");
        }

        if (data instanceof TranscriptionData) {
            TranscriptionData transcriptionData = (TranscriptionData) data;
            System.out.println("----------------------------------------------------------------");
            System.out.println("TEXT: --> " + transcriptionData.getText());
            System.out.println("FORMAT: --> " + transcriptionData.getFormat());
            System.out.println("CONFIDENCE: --> " + transcriptionData.getConfidence());
            System.out.println("OFFSET: --> " + transcriptionData.getOffset());
            System.out.println("DURATION: --> " + transcriptionData.getDuration());
            System.out.println("RESULT STATUS: --> " + transcriptionData.getResultStatus());
            for (Word word : transcriptionData.getWords()) {
                System.out.println("Text: --> " + word.getText());
                System.out.println("Offset: --> " + word.getOffset());
                System.out.println("Duration: --> " + word.getDuration());
            }
            System.out.println("----------------------------------------------------------------");
        }
    }
}

Update Transcription

client.getCallConnection(callConnectionId)
    .getCallMedia()
    .updateTranscription("en-US-NancyNeural");

Stop Transcription

When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.

// Option 1: Stop transcription with options
StopTranscriptionOptions stopTranscriptionOptions = new StopTranscriptionOptions()
    .setOperationContext("stopTranscription");

client.getCallConnection(callConnectionId)
    .getCallMedia()
    .stopTranscriptionWithResponse(stopTranscriptionOptions, Context.NONE);

// Alternative: Stop transcription without options
// client.getCallConnection(callConnectionId)
//     .getCallMedia()
//     .stopTranscription();

Create a call and provide the transcription details

Define the TranscriptionOptions for ACS to specify when to start the transcription, the locale for transcription, and the web socket connection for sending the transcript.

const transcriptionOptions = {
    transportUrl: "",
    transportType: "websocket",
    locale: "en-US",
    startTranscription: false
};

const options = {
    callIntelligenceOptions: {
        cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
    },
    transcriptionOptions: transcriptionOptions
};

console.log("Placing outbound call...");
acsClient.createCall(callInvite, process.env.CALLBACK_URI + "/api/callbacks", options);

Connect to a Rooms call and provide transcription details

If you're connecting to an ACS room and want to use transcription, configure the transcription options as follows:

const transcriptionOptions = {
    transportUri: "",
    locale: "en-US",
    transcriptionTransport: "websocket",
    startTranscription: false
};

const callIntelligenceOptions = {
    cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
};

const connectCallOptions = {
    callIntelligenceOptions: callIntelligenceOptions,
    transcriptionOptions: transcriptionOptions
};

const callLocator = {
    id: roomId,
    kind: "roomCallLocator"
};

const connectResult = await client.connectCall(callLocator, callBackUri, connectCallOptions);

Start Transcription

Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.

const startTranscriptionOptions = {
    locale: "en-AU",
    operationContext: "startTranscriptionContext"
};

// Start transcription with options
await callMedia.startTranscription(startTranscriptionOptions);

// Alternative: Start transcription without options
// await callMedia.startTranscription();

Receiving Transcription Stream

When transcription starts, your websocket receives the transcription metadata payload as the first packet.

{
    "kind": "TranscriptionMetadata",
    "transcriptionMetadata": {
        "subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
        "locale": "en-us",
        "callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
        "correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
    }
}

Receiving Transcription Data

After the metadata, the next packets your web socket receives will be TranscriptionData for the transcribed audio.

{
    "kind": "TranscriptionData",
    "transcriptionData": {
        "text": "Testing transcription.",
        "format": "display",
        "confidence": 0.695223331451416,
        "offset": 2516998782481234400,
        "words": [
            {
                "text": "testing",
                "offset": 2516998782481234400
            },
            {
                "text": "testing",
                "offset": 2516998782481234400
            }
        ],
        "participantRawID": "8:acs:",
        "resultStatus": "Final"
    }
}

Handling transcription stream in the web socket server

import WebSocket from 'ws';
import { streamingData } from '@azure/communication-call-automation/src/util/streamingDataParser';

const wss = new WebSocket.Server({ port: 8081 });

wss.on('connection', (ws) => {
    console.log('Client connected');

    ws.on('message', (packetData) => {
        const decoder = new TextDecoder();
        const stringJson = decoder.decode(packetData);
        console.log("STRING JSON => " + stringJson);

        const response = streamingData(packetData);
        
        if ('locale' in response) {
            console.log("Transcription Metadata");
            console.log(response.callConnectionId);
            console.log(response.correlationId);
            console.log(response.locale);
            console.log(response.subscriptionId);
        }
        
        if ('text' in response) {
            console.log("Transcription Data");
            console.log(response.text);
            console.log(response.format);
            console.log(response.confidence);
            console.log(response.offset);
            console.log(response.duration);
            console.log(response.resultStatus);

            if ('phoneNumber' in response.participant) {
                console.log(response.participant.phoneNumber);
            }

            response.words.forEach((word) => {
                console.log(word.text);
                console.log(word.duration);
                console.log(word.offset);
            });
        }
    });

    ws.on('close', () => {
        console.log('Client disconnected');
    });
});

console.log('WebSocket server running on port 8081');

Update Transcription

For situations where your application allows users to select their preferred language, you may also want to capture the transcription in that language. To do this task, the Call Automation SDK allows you to update the transcription locale.

await callMedia.updateTranscription("en-US-NancyNeural");

Stop Transcription

When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.

const stopTranscriptionOptions = {
    operationContext: "stopTranscriptionContext"
};

// Stop transcription with options
await callMedia.stopTranscription(stopTranscriptionOptions);

// Alternative: Stop transcription without options
// await callMedia.stopTranscription();

Create a call and provide the transcription details

Define the TranscriptionOptions for ACS to specify when to start the transcription, the locale for transcription, and the web socket connection for sending the transcript.

transcription_options = TranscriptionOptions(
    transport_url=" ",
    transport_type=TranscriptionTransportType.WEBSOCKET,
    locale="en-US",
    start_transcription=False
)

call_connection_properties = call_automation_client.create_call(
    target_participant,
    CALLBACK_EVENTS_URI,
    cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
    source_caller_id_number=source_caller,
    transcription=transcription_options
)

Connect to a Rooms call and provide transcription details

If you're connecting to an ACS room and want to use transcription, configure the transcription options as follows:

transcription_options = TranscriptionOptions(
    transport_url="",
    transport_type=TranscriptionTransportType.WEBSOCKET,
    locale="en-US",
    start_transcription=False
)

connect_result = client.connect_call(
    room_id="roomid",
    CALLBACK_EVENTS_URI,
    cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
    operation_context="connectCallContext",
    transcription=transcription_options
)

Start Transcription

Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.

# Start transcription without options
call_connection_client.start_transcription()

# Option 1: Start transcription with locale and operation context
# call_connection_client.start_transcription(locale="en-AU", operation_context="startTranscriptionContext")

# Option 2: Start transcription with operation context
# call_connection_client.start_transcription(operation_context="startTranscriptionContext")

Receiving Transcription Stream

When transcription starts, your websocket receives the transcription metadata payload as the first packet.

{
    "kind": "TranscriptionMetadata",
    "transcriptionMetadata": {
        "subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
        "locale": "en-us",
        "callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
        "correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
    }
}

Receiving Transcription Data

After the metadata, the next packets your websocket receives will be TranscriptionData for the transcribed audio.

{
    "kind": "TranscriptionData",
    "transcriptionData": {
        "text": "Testing transcription.",
        "format": "display",
        "confidence": 0.695223331451416,
        "offset": 2516998782481234400,
        "words": [
            {
                "text": "testing",
                "offset": 2516998782481234400
            },
            {
                "text": "testing",
                "offset": 2516998782481234400
            }
        ],
        "participantRawID": "8:acs:",
        "resultStatus": "Final"
    }
}

Handling transcription stream in the web socket server

import asyncio
import json
import websockets
from azure.communication.callautomation._shared.models import identifier_from_raw_id

async def handle_client(websocket, path):
    print("Client connected")
    try:
        async for message in websocket:
            json_object = json.loads(message)
            kind = json_object['kind']
            if kind == 'TranscriptionMetadata':
                print("Transcription metadata")
                print("-------------------------")
                print("Subscription ID:", json_object['transcriptionMetadata']['subscriptionId'])
                print("Locale:", json_object['transcriptionMetadata']['locale'])
                print("Call Connection ID:", json_object['transcriptionMetadata']['callConnectionId'])
                print("Correlation ID:", json_object['transcriptionMetadata']['correlationId'])
            if kind == 'TranscriptionData':
                participant = identifier_from_raw_id(json_object['transcriptionData']['participantRawID'])
                word_data_list = json_object['transcriptionData']['words']
                print("Transcription data")
                print("-------------------------")
                print("Text:", json_object['transcriptionData']['text'])
                print("Format:", json_object['transcriptionData']['format'])
                print("Confidence:", json_object['transcriptionData']['confidence'])
                print("Offset:", json_object['transcriptionData']['offset'])
                print("Duration:", json_object['transcriptionData']['duration'])
                print("Participant:", participant.raw_id)
                print("Result Status:", json_object['transcriptionData']['resultStatus'])
                for word in word_data_list:
                    print("Word:", word['text'])
                    print("Offset:", word['offset'])
                    print("Duration:", word['duration'])
            
    except websockets.exceptions.ConnectionClosedOK:
        print("Client disconnected")
    except websockets.exceptions.ConnectionClosedError as e:
        print("Connection closed with error: %s", e)
    except Exception as e:
        print("Unexpected error: %s", e)

start_server = websockets.serve(handle_client, "localhost", 8081)

print('WebSocket server running on port 8081')

asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()

Update Transcription

await call_connection_client.update_transcription(locale="en-US-NancyNeural")

Stop Transcription

When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.

# Stop transcription without options
call_connection_client.stop_transcription()

# Alternative: Stop transcription with operation context
# call_connection_client.stop_transcription(operation_context="stopTranscriptionContext")

Event codes

Event	code	subcode	Message
TranscriptionStarted	200	0	Action completed successfully.
TranscriptionStopped	200	0	Action completed successfully.
TranscriptionUpdated	200	0	Action completed successfully.
TranscriptionFailed	400	8581	Action failed, StreamUrl isn't valid.
TrasncriptionFailed	400	8565	Action failed due to a bad request to Cognitive Services. Check your input parameters.
TranscriptionFailed	400	8565	Action failed due to a request to Cognitive Services timing out. Try again later or check for any issues with the service.
TranscriptionFailed	400	8605	Custom speech recognition model for Transcription is not supported.
TranscriptionFailed	400	8523	Invalid Request, locale is missing.
TranscriptionFailed	400	8523	Invalid Request, only locale that contain region information are supported.
TranscriptionFailed	405	8520	Transcription functionality is not supported at this time.
TranscriptionFailed	405	8520	UpdateTranscription is not supported for connection created with Connect interface.
TranscriptionFailed	400	8528	Action is invalid, call already terminated.
TranscriptionFailed	405	8520	Update transcription functionality is not supported at this time.
TranscriptionFailed	405	8522	Request not allowed when Transcription url not set during call setup.
TranscriptionFailed	405	8522	Request not allowed when Cognitive Service Configuration not set during call setup.
TranscriptionFailed	400	8501	Action is invalid when call is not in Established state.
TranscriptionFailed	401	8565	Action failed due to a Cognitive Services authentication error. Check your authorization input and ensure it's correct.
TranscriptionFailed	403	8565	Action failed due to a forbidden request to Cognitive Services. Check your subscription status and ensure it's active.
TranscriptionFailed	429	8565	Action failed, requests exceeded the number of allowed concurrent requests for the cognitive services subscription.
TranscriptionFailed	500	8578	Action failed, not able to establish WebSocket connection.
TranscriptionFailed	500	8580	Action failed, transcription service was shut down.
TranscriptionFailed	500	8579	Action failed, transcription was canceled.
TranscriptionFailed	500	9999	Unknown internal server error.

Known issues

For 1:1 calls with ACS users using Client SDKs startTranscription = True isn't currently supported.

Share via

Add real-time transcription into your application

Prerequisites

Setup a WebSocket Server

Establish a call

Create a call and provide the transcription details

Connect to a Rooms call and provide transcription details

Start Transcription

Receiving Transcription Stream

Receiving Transcription data

Handling transcription stream in the web socket server

Updates to your code for the websocket handler

Update Transcription

Stop Transcription

Create a call and provide the transcription details

Connect to a Rooms call and provide transcription details

Start Transcription

Receiving Transcription Stream

Receiving Transcription data

Handling transcription stream in the web socket server

Updates to your code for the websocket handler

Update Transcription

Stop Transcription

Create a call and provide the transcription details

Connect to a Rooms call and provide transcription details

Start Transcription

Receiving Transcription Stream

Receiving Transcription Data

Handling transcription stream in the web socket server

Update Transcription

Stop Transcription

Create a call and provide the transcription details

Connect to a Rooms call and provide transcription details

Start Transcription

Receiving Transcription Stream

Receiving Transcription Data

Handling transcription stream in the web socket server

Update Transcription

Stop Transcription

Event codes

Known issues

Feedback

Additional resources