Azure AI 音声でファストトランスクリプション API を使用する

[アーティクル]
11/14/2024

ファストトランスクリプション API を使用すると、オーディオファイルが文字起こしされ、その結果が同期して返されます。これは、リアルタイムよりも高速です。ファストトランスクリプションは、オーディオ録音の文字起こしを予測可能な待機時間でできるだけ早く必要とする次のようなシナリオで使用されます。

オーディオまたはビデオの文字起こし、字幕、編集を迅速に行う場合。
ビデオの翻訳

Batch 文字起こし API とは異なり、ファストトランスクリプション API では、表示 (語彙ではなく) フォームでのみ文字起こしが生成されます。表示フォームの方が、句読点や大文字化を含む文字起こしのより読みやすい形式です。

前提条件

ファストトランスクリプション API を使用できるリージョンの 1 つにある Azure AI 音声リソース。サポートされるリージョンは、オーストラリア東部、ブラジル南部、インド中部、米国東部、米国東部 2、フランス中部、東日本、米国中北部、北ヨーロッパ、米国中南部、東南アジア、スウェーデン中部、西ヨーロッパ、米国西部、米国西部 2、米国西部 3 です。その他の音声サービス機能でサポートされるリージョンの詳細については、「音声サービスのリージョン」を参照してください。
バッチ文字起こし API でサポートされている形式とコーデックのいずれかのオーディオファイル (長さが 2 時間未満、サイズが 200 MB 未満)。サポートされているオーディオ形式の詳細については、サポートされているオーディオ形式のセクションを参照してください。

ファストトランスクリプション API を使用する

ヒント

Azure AI Studio でファストトランスクリプションを試してみてください。

次のシナリオでファストトランスクリプション API (文字起こし - 文字起こしを経由) を使用する方法について説明します。

指定された既知のロケール: 指定されたロケールでオーディオファイルを文字起こしします。オーディオファイルのロケールがわかっている場合は、それを指定して文字起こしの精度を向上させ、待機時間を最小限に抑えることができます。
言語識別がオン: 言語識別をオンにしてオーディオファイルを文字起こしします。オーディオファイルのロケールがわからない場合は、言語識別をオンにして、音声サービスでロケールを識別できるようにします。
ダイアライゼーションがオン: ダイアライゼーションをオンにしてオーディオファイルを文字起こしします。ダイアライゼーションは、会話内で異なる話者を区別します。音声サービスは、文字起こしされた音声の特定の部分を話していた話者に関する情報を提供します。
マルチチャネルがオン: 1 つまたは 2 つのチャネルを持つオーディオファイルを文字起こしします。マルチチャネル文字起こしは、複数の話者がいるオーディオファイルやバックグラウンドノイズがあるオーディオファイルなど、複数のチャネルを持つオーディオファイルに役立ちます。既定では、ファストトランスクリプション API は、すべての入力チャネルを 1 つのチャネルにマージしてから、文字起こしを実行します。これが望ましくない場合は、マージせずにチャネルを個別に文字起こしできます。

オーディオファイルと要求本文のプロパティを使用して、transcriptions エンドポイントに対して multipart/form-data POST 要求を行います。

次の例は、指定されたロケールでオーディオファイルを文字起こしする方法を示しています。オーディオファイルのロケールがわかっている場合は、それを指定して文字起こしの精度を向上させ、待機時間を最小限に抑えることができます。

YourSubscriptionKey をSpeech リソースキーに置き換えます。
YourServiceRegion を Azure Cognitive Service for Speech リソースのリージョンに置き換えます。
YourAudioFile を、オーディオファイルへのパスに置き換えます。

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"]}"'

次の手順に従ってフォームの定義を作成します。

文字起こしするオーディオデータの想定されるロケールと一致する必要がある省略可能な (ただし推奨される) locales プロパティを設定します。この例では、ロケールは en-US に設定されています。サポートされている指定可能なロケールは de-DE、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR、zh-CN です。

ファストトランスクリプション API の locales およびその他のプロパティの詳細については、このガイドの後半にある「要求の構成オプション」セクションを参照してください。

応答には durationMilliseconds、offsetMilliseconds などが含まれています。 combinedPhrases プロパティには、すべての話者の完全な文字起こしが含まれています。

{
	"durationMilliseconds": 182439,
	"combinedPhrases": [
		{
			"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And you're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
		}
	],
	"phrases": [
		{
			"offsetMilliseconds": 960,
			"durationMilliseconds": 640,
			"text": "Good afternoon.",
			"words": [
				{
					"text": "Good",
					"offsetMilliseconds": 960,
					"durationMilliseconds": 240
				},
				{
					"text": "afternoon.",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 400
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 1600,
			"durationMilliseconds": 640,
			"text": "This is Sam.",
			"words": [
				{
					"text": "This",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "is",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 120
				},
				{
					"text": "Sam.",
					"offsetMilliseconds": 1960,
					"durationMilliseconds": 280
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 2240,
			"durationMilliseconds": 1040,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 2240,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 2440,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 2520,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 200
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 2840,
					"durationMilliseconds": 440
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 3280,
			"durationMilliseconds": 640,
			"text": "How can I help?",
			"words": [
				{
					"text": "How",
					"offsetMilliseconds": 3280,
					"durationMilliseconds": 120
				},
				{
					"text": "can",
					"offsetMilliseconds": 3440,
					"durationMilliseconds": 120
				},
				{
					"text": "I",
					"offsetMilliseconds": 3560,
					"durationMilliseconds": 40
				},
				{
					"text": "help?",
					"offsetMilliseconds": 3600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 5040,
			"durationMilliseconds": 400,
			"text": "Hi there.",
			"words": [
				{
					"text": "Hi",
					"offsetMilliseconds": 5040,
					"durationMilliseconds": 240
				},
				{
					"text": "there.",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 160
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 5440,
			"durationMilliseconds": 800,
			"text": "My name is Mary.",
			"words": [
				{
					"text": "My",
					"offsetMilliseconds": 5440,
					"durationMilliseconds": 80
				},
				{
					"text": "name",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5640,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 5720,
					"durationMilliseconds": 520
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"offsetMilliseconds": 180320,
			"durationMilliseconds": 680,
			"text": "Thank you for your help.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 180320,
					"durationMilliseconds": 160
				},
				{
					"text": "you",
					"offsetMilliseconds": 180480,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 180560,
					"durationMilliseconds": 120
				},
				{
					"text": "your",
					"offsetMilliseconds": 180680,
					"durationMilliseconds": 120
				},
				{
					"text": "help.",
					"offsetMilliseconds": 180800,
					"durationMilliseconds": 200
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		},
		{
			"offsetMilliseconds": 181960,
			"durationMilliseconds": 280,
			"text": "Thank you.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 181960,
					"durationMilliseconds": 200
				},
				{
					"text": "you.",
					"offsetMilliseconds": 182160,
					"durationMilliseconds": 80
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		}
	]
}

オーディオファイルと要求本文のプロパティを使用して、transcriptions エンドポイントに対して multipart/form-data POST 要求を行います。

次の例は、言語識別をオンにしてオーディオファイルを文字起こしする方法を示しています。ロケールがわからない場合は、複数のロケールを指定できます。ロケールを指定しない場合、または指定したロケールがオーディオファイルにない場合、音声サービスはロケールの識別を試みます。

YourSubscriptionKey をSpeech リソースキーに置き換えます。
YourServiceRegion を Azure Cognitive Service for Speech リソースのリージョンに置き換えます。
YourAudioFile を、オーディオファイルへのパスに置き換えます。

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US","ja-JP"]}"'

次の手順に従ってフォームの定義を作成します。

文字起こしするオーディオデータの想定されるロケールと一致する必要がある省略可能な (ただし推奨される) locales プロパティを設定します。この例では、ロケールは en-US と ja-JP に設定されます。サポートされている指定可能なロケールは de-DE、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR、zh-CN です。

{
	"durationMilliseconds": 185079,
	"combinedPhrases": [
		{
			"text": "Hello, thank you for calling Contoso. Who am I speaking with today? Hi, my name is Mary Rondo. I'm trying to enroll myself with Contoso. Hi, Mary. Are you calling because you need health insurance? Yes. Yeah, I'm calling to sign up for insurance. Great. Uh If you can answer a few questions, we can get you signed up in a Jiffy. Okay. So what's your full name? uh So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. Got it. And what's the best callback number in case we get disconnected? I only have a cell phone, so I can give you that. Yep, that'll be fine. Sure. So it's 234-554 and then 9312. Got it. So to confirm, it's 234-554-9312. Yep, that's right. Excellent. Let's get some additional information for your application. Do you have a job? Uh Yes, I am self-employed. Okay, so then you have a social security number as well? Uh Yes, I do. Okay, and what is your social security number, please? Uh Sure, so it's 412-253-4931. 6789. Sorry, was that a 25 or a 225? You cut out for a bit. It's double two, so 412, then another two, then five. Thank you so much. And could I have your e-mail address, please? Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. That sounds good. Thank you. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Actually, so I have one more question. Yes, of course. I'm curious, will I be getting a physical card as proof of coverage? So the default is a digital membership card, but we can send you a physical card if you prefer. Uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? Uh Yeah. uh So it's 2660 Unit A on Maple Avenue, Southeast Lansing, and then zip code is 48823. Absolutely. I've made a note on your file. Awesome. Thanks so much. You're very welcome. Thank you for calling Contoso and have a great day."
		}
	],
	"phrases": [
		{
			"offsetMilliseconds": 720,
			"durationMilliseconds": 1600,
			"text": "Hello, thank you for calling Contoso.",
			"words": [
				{
					"text": "Hello,",
					"offsetMilliseconds": 720,
					"durationMilliseconds": 480
				},
				{
					"text": "thank",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 1400,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 1480,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 2320,
			"durationMilliseconds": 1120,
			"text": "Who am I speaking with today?",
			"words": [
				{
					"text": "Who",
					"offsetMilliseconds": 2320,
					"durationMilliseconds": 160
				},
				{
					"text": "am",
					"offsetMilliseconds": 2480,
					"durationMilliseconds": 80
				},
				{
					"text": "I",
					"offsetMilliseconds": 2560,
					"durationMilliseconds": 80
				},
				{
					"text": "speaking",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 320
				},
				{
					"text": "with",
					"offsetMilliseconds": 2960,
					"durationMilliseconds": 160
				},
				{
					"text": "today?",
					"offsetMilliseconds": 3120,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 4480,
			"durationMilliseconds": 1600,
			"text": "Hi, my name is Mary Rondo.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 4480,
					"durationMilliseconds": 400
				},
				{
					"text": "my",
					"offsetMilliseconds": 4880,
					"durationMilliseconds": 120
				},
				{
					"text": "name",
					"offsetMilliseconds": 5000,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5120,
					"durationMilliseconds": 160
				},
				{
					"text": "Mary",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 240
				},
				{
					"text": "Rondo.",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 560
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 6120,
			"durationMilliseconds": 1800,
			"text": "I'm trying to enroll myself with Contoso.",
			"words": [
				{
					"text": "I'm",
					"offsetMilliseconds": 6120,
					"durationMilliseconds": 120
				},
				{
					"text": "trying",
					"offsetMilliseconds": 6240,
					"durationMilliseconds": 200
				},
				{
					"text": "to",
					"offsetMilliseconds": 6440,
					"durationMilliseconds": 80
				},
				{
					"text": "enroll",
					"offsetMilliseconds": 6520,
					"durationMilliseconds": 200
				},
				{
					"text": "myself",
					"offsetMilliseconds": 6720,
					"durationMilliseconds": 360
				},
				{
					"text": "with",
					"offsetMilliseconds": 7080,
					"durationMilliseconds": 120
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 7200,
					"durationMilliseconds": 720
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"offsetMilliseconds": 181520,
			"durationMilliseconds": 720,
			"text": "You're very welcome.",
			"words": [
				{
					"text": "You're",
					"offsetMilliseconds": 181520,
					"durationMilliseconds": 160
				},
				{
					"text": "very",
					"offsetMilliseconds": 181680,
					"durationMilliseconds": 200
				},
				{
					"text": "welcome.",
					"offsetMilliseconds": 181880,
					"durationMilliseconds": 360
				}
			],
			"locale": "en-US",
			"confidence": 0.90571773
		},
		{
			"offsetMilliseconds": 182320,
			"durationMilliseconds": 1840,
			"text": "Thank you for calling Contoso and have a great day.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 182320,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 182520,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 182600,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 182720,
					"durationMilliseconds": 280
				},
				{
					"text": "Contoso",
					"offsetMilliseconds": 183000,
					"durationMilliseconds": 520
				},
				{
					"text": "and",
					"offsetMilliseconds": 183520,
					"durationMilliseconds": 160
				},
				{
					"text": "have",
					"offsetMilliseconds": 183680,
					"durationMilliseconds": 120
				},
				{
					"text": "a",
					"offsetMilliseconds": 183800,
					"durationMilliseconds": 40
				},
				{
					"text": "great",
					"offsetMilliseconds": 183840,
					"durationMilliseconds": 200
				},
				{
					"text": "day.",
					"offsetMilliseconds": 184040,
					"durationMilliseconds": 120
				}
			],
			"locale": "en-US",
			"confidence": 0.90571773
		}
	]
}

オーディオファイルと要求本文のプロパティを使用して、transcriptions エンドポイントに対して multipart/form-data POST 要求を行います。

次の例は、ダイアライゼーションをオンにしてオーディオファイルを文字起こしする方法を示しています。ダイアライゼーションは、会話内で異なる話者を区別します。音声サービスは、文字起こしされた音声の特定の部分を話していた話者に関する情報を提供します。

YourSubscriptionKey をSpeech リソースキーに置き換えます。
YourServiceRegion を Azure Cognitive Service for Speech リソースのリージョンに置き換えます。
YourAudioFile を、オーディオファイルへのパスに置き換えます。

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"], 
    "diarization": {"maxSpeakers": 2,"enabled": true}}"'

次の手順に従ってフォームの定義を作成します。

文字起こしするオーディオデータの想定されるロケールと一致する必要がある省略可能な (ただし推奨される) locales プロパティを設定します。この例では、ロケールは en-US に設定されています。サポートされている指定可能なロケールは de-DE、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR、zh-CN です。
1 つのオーディオチャネルで複数の話者を認識して分離するように diarization プロパティを設定します。たとえば、"diarization": {"maxSpeakers": 2, "enabled": true} と指定します。文字起こしファイルには、文字起こしされたフレーズごとに speaker エントリが含まれます。

ファストトランスクリプション API の locales、diarization、およびその他のプロパティの詳細については、このガイドの後半にある「要求の構成オプション」セクションを参照してください。

応答には durationMilliseconds、offsetMilliseconds などが含まれています。この例では、ダイアライゼーションが有効になっているため、応答には、文字起こしされたフレーズごとに speaker 情報が含まれます。 combinedPhrases プロパティには、1 つのチャネル内のすべての話者の完全な文字起こしが含まれています。

{
	"durationMilliseconds": 182439,
	"combinedPhrases": [
		{
			"channel": 0,
			"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh. Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? Uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And. You're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
		}
	],
	"phrases": [
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 960,
			"durationMilliseconds": 640,
			"text": "Good afternoon.",
			"words": [
				{
					"text": "Good",
					"offsetMilliseconds": 960,
					"durationMilliseconds": 240
				},
				{
					"text": "afternoon.",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 400
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 1600,
			"durationMilliseconds": 640,
			"text": "This is Sam.",
			"words": [
				{
					"text": "This",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "is",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 120
				},
				{
					"text": "Sam.",
					"offsetMilliseconds": 1960,
					"durationMilliseconds": 280
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 2240,
			"durationMilliseconds": 1040,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 2240,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 2440,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 2520,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 200
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 2840,
					"durationMilliseconds": 440
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 3280,
			"durationMilliseconds": 640,
			"text": "How can I help?",
			"words": [
				{
					"text": "How",
					"offsetMilliseconds": 3280,
					"durationMilliseconds": 120
				},
				{
					"text": "can",
					"offsetMilliseconds": 3440,
					"durationMilliseconds": 120
				},
				{
					"text": "I",
					"offsetMilliseconds": 3560,
					"durationMilliseconds": 40
				},
				{
					"text": "help?",
					"offsetMilliseconds": 3600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 5040,
			"durationMilliseconds": 400,
			"text": "Hi there.",
			"words": [
				{
					"text": "Hi",
					"offsetMilliseconds": 5040,
					"durationMilliseconds": 240
				},
				{
					"text": "there.",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 160
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 5440,
			"durationMilliseconds": 800,
			"text": "My name is Mary.",
			"words": [
				{
					"text": "My",
					"offsetMilliseconds": 5440,
					"durationMilliseconds": 80
				},
				{
					"text": "name",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5640,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 5720,
					"durationMilliseconds": 520
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 180320,
			"durationMilliseconds": 680,
			"text": "Thank you for your help.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 180320,
					"durationMilliseconds": 160
				},
				{
					"text": "you",
					"offsetMilliseconds": 180480,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 180560,
					"durationMilliseconds": 120
				},
				{
					"text": "your",
					"offsetMilliseconds": 180680,
					"durationMilliseconds": 120
				},
				{
					"text": "help.",
					"offsetMilliseconds": 180800,
					"durationMilliseconds": 200
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 181960,
			"durationMilliseconds": 280,
			"text": "Thank you.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 181960,
					"durationMilliseconds": 200
				},
				{
					"text": "you.",
					"offsetMilliseconds": 182160,
					"durationMilliseconds": 80
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		}
    ]
}

オーディオファイルと要求本文のプロパティを使用して、transcriptions エンドポイントに対して multipart/form-data POST 要求を行います。

次の例は、1 つまたは 2 つのチャネルがあるオーディオファイルを文字起こしする方法を示しています。マルチチャネル文字起こしは、複数の話者がいるオーディオファイルやバックグラウンドノイズがあるオーディオファイルなど、複数のチャネルを持つオーディオファイルに役立ちます。既定では、ファストトランスクリプション API は、すべての入力チャネルを 1 つのチャネルにマージしてから、文字起こしを実行します。これが望ましくない場合は、マージせずにチャネルを個別に文字起こしできます。

YourSubscriptionKey をSpeech リソースキーに置き換えます。
YourServiceRegion を Azure Cognitive Service for Speech リソースのリージョンに置き換えます。
YourAudioFile を、オーディオファイルへのパスに置き換えます。

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"], 
    "channels": [0,1]}"'

次の手順に従ってフォームの定義を作成します。

文字起こしするオーディオデータの想定されるロケールと一致する必要がある省略可能な (ただし推奨される) locales プロパティを設定します。この例では、ロケールは en-US に設定されています。サポートされている指定可能なロケールは de-DE、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR、zh-CN です。
channels プロパティを設定して、個別に文字起こしするチャネルの 0 から始まるインデックスを指定します。ダイアライゼーションが有効になっていない限り、最大 2 つのチャネルがサポートされます。この例では、チャネル 0 と 1 が指定されています。

ファストトランスクリプション API の locales、channels、およびその他のプロパティの詳細については、このガイドの後半にある「要求の構成オプション」セクションを参照してください。

応答には durationMilliseconds、offsetMilliseconds などが含まれています。 channel プロパティは、オーディオファイルに複数のチャネルが含まれている場合にチャネルを識別します。 combinedPhrases プロパティには、オーディオチャネルごとに別々に完全な文字起こしが含まれています。チャネルごとの完全な文字起こしを識別するには、"channel": 0,"text" と "channel": 1,"text" を探します。

{
	"durationMilliseconds": 185079,
	"combinedPhrases": [
		{
			"channel": 0,
			"text": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
		},
		{
			"channel": 1,
			"text": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
		}
	],
	"phrases": [
		{
			"channel": 0,
			"offsetMilliseconds": 720,
			"durationMilliseconds": 480,
			"text": "Hello.",
			"words": [
				{
					"text": "Hello.",
					"offsetMilliseconds": 720,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 1200,
			"durationMilliseconds": 1120,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 1400,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 1480,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 2320,
			"durationMilliseconds": 1120,
			"text": "Who am I speaking with today?",
			"words": [
				{
					"text": "Who",
					"offsetMilliseconds": 2320,
					"durationMilliseconds": 160
				},
				{
					"text": "am",
					"offsetMilliseconds": 2480,
					"durationMilliseconds": 80
				},
				{
					"text": "I",
					"offsetMilliseconds": 2560,
					"durationMilliseconds": 80
				},
				{
					"text": "speaking",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 320
				},
				{
					"text": "with",
					"offsetMilliseconds": 2960,
					"durationMilliseconds": 160
				},
				{
					"text": "today?",
					"offsetMilliseconds": 3120,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 9520,
			"durationMilliseconds": 400,
			"text": "Hi, Mary.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 9520,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 9600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"channel": 1,
			"offsetMilliseconds": 4480,
			"durationMilliseconds": 1600,
			"text": "Hi, my name is Mary Rondo.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 4480,
					"durationMilliseconds": 400
				},
				{
					"text": "my",
					"offsetMilliseconds": 4880,
					"durationMilliseconds": 120
				},
				{
					"text": "name",
					"offsetMilliseconds": 5000,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5120,
					"durationMilliseconds": 160
				},
				{
					"text": "Mary",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 240
				},
				{
					"text": "Rondo.",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 560
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
		{
			"channel": 1,
			"offsetMilliseconds": 6080,
			"durationMilliseconds": 1920,
			"text": "I'm trying to enroll myself with Contuso.",
			"words": [
				{
					"text": "I'm",
					"offsetMilliseconds": 6080,
					"durationMilliseconds": 160
				},
				{
					"text": "trying",
					"offsetMilliseconds": 6240,
					"durationMilliseconds": 200
				},
				{
					"text": "to",
					"offsetMilliseconds": 6440,
					"durationMilliseconds": 80
				},
				{
					"text": "enroll",
					"offsetMilliseconds": 6520,
					"durationMilliseconds": 200
				},
				{
					"text": "myself",
					"offsetMilliseconds": 6720,
					"durationMilliseconds": 360
				},
				{
					"text": "with",
					"offsetMilliseconds": 7080,
					"durationMilliseconds": 120
				},
				{
					"text": "Contuso.",
					"offsetMilliseconds": 7200,
					"durationMilliseconds": 800
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
		// More transcription results...
	    // Redacted for brevity
    ]
}

要求の構成オプション

文字起こし - 文字起こし操作を呼び出すときに文字起こしを構成するためのいくつかのプロパティオプションを次に示します。

プロパティ	説明	必須または省略可能
`channels`	個別に文字起こしするチャネルの 0 から始まるインデックスのリスト。ダイアライゼーションが有効になっていない限り、最大 2 つのチャネルがサポートされます。既定では、ファストトランスクリプション API は、すべての入力チャネルを 1 つのチャネルにマージしてから、文字起こしを実行します。これが望ましくない場合は、マージせずにチャネルを個別に文字起こしできます。ステレオオーディオファイルのチャネルを個別に文字起こしする場合は、`[0,1]`、`[0]`、または `[1]` を指定する必要があります。そうしないと、ステレオオーディオはモノラルにマージされ、1 つのチャネルのみが文字起こしされます。オーディオがステレオで、ダイアライゼーションが有効になっている場合、`channels` プロパティを `[0,1]` に設定することはできません。音声サービスでは、複数のチャネルのダイアライゼーションはサポートされていません。モノラルオーディオの場合、`channels` プロパティは無視され、オーディオは常に 1 つのチャネルとして文字起こしされます。	省略可能
`diarization`	ダイアライゼーション構成。ダイアライゼーションとは、1 つのオーディオチャネルで複数の話者を認識し、分離するプロセスです。たとえば、`"diarization": {"maxSpeakers": 2, "enabled": true}` と指定します。文字起こしファイルには、文字起こしされたフレーズごとに `speaker` エントリ (`"speaker": 0` または `"speaker": 1` など) が含まれます。	省略可能
`locales`	文字起こしする音声データの想定されるロケールと一致する必要があるロケールのリスト。オーディオファイルのロケールがわかっている場合は、それを指定して文字起こしの精度を向上させ、待機時間を最小限に抑えることができます。 1 つのロケールが指定されている場合、そのロケールが文字起こしに使用されます。ただし、ロケールがわからない場合は、複数のロケールを指定できます。候補ロケールのより正確なリストを使用すると、言語の識別がより正確になる場合があります。ロケールを指定しない場合、または指定したロケールがオーディオファイルにない場合、音声サービスは引き続き言語の識別を試みます。言語を識別できない場合は、エラーが返されます。サポートされている指定可能なロケールは de-DE、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR、zh-CN です。文字起こし - サポートされているロケールの一覧 REST API から、サポートされている最新の言語を取得できます。ロケールの詳細については、音声サービスの言語サポートに関するドキュメントを参照してください。	省略可能。ただし、想定されるロケールがわかっている場合は推奨。
`profanityFilterMode`	認識結果内の不適切な表現をどう扱うかを指定します。指定できる値は、`None` (不適切な表現のフィルターを無効にする)、`Masked` (不適切な表現をアスタリスクに置き換える)、`Removed` (すべての不適切な表現を結果から除去する)、または `Tags` (不適切な表現のタグを追加する) です。既定値は `Masked` です。	省略可能

次の方法で共有

Azure AI 音声でファストトランスクリプション API を使用する

前提条件

ファストトランスクリプション API を使用する

要求の構成オプション

フィードバック

その他のリソース

次の方法で共有

Azure AI 音声でファスト トランスクリプション API を使用する

前提条件

ファスト トランスクリプション API を使用する

要求の構成オプション

関連するコンテンツ

フィードバック

その他のリソース

Azure AI 音声でファストトランスクリプション API を使用する

ファストトランスクリプション API を使用する