Verwenden der API für schnelle Transkription mit Azure KI Speech

Artikel
11/14/2024

Die API für schnelle Transkription wird verwendet, um Audiodateien synchron und schneller als Echtzeit zu transkribieren. Verwenden Sie schnelle Transkription in den Szenarien, in denen Sie die Transkription einer Audioaufzeichnung so schnell wie möglich mit vorhersagbarer Latenz benötigen, z. B.:

Schnelle Audio- oder Videotranskription, Untertitel und Bearbeitung.
Videoübersetzung

Im Gegensatz zur Batchtranskriptions-API erzeugt die schnelle Transkription-API nur Transkriptionen in der Anzeigeform (nicht lexikalisch). Das Anzeigeformular ist eine lesbarere Form der Transkription, die Interpunktion und Großschreibung umfasst.

Voraussetzungen

Eine Azure KI Speech-Ressource in einer der Regionen, in denen die API für schnelle Transkription verfügbar ist. Die unterstützten Regionen sind: Australien, Osten, Brasilien, Süden, Indien, Mitte, USA, Osten, USA, Osten 2, Frankreich, Mitte, Japan, Osten, USA, Norden-Mitte, Europa, Norden, USA, Süden-Mitte, Asien, Südosten, Schweden, Mitte, Europa, Westen, USA, Westen, USA, Westen 2, USA, Westen 3. Weitere Informationen zu Regionen, die für andere Sprachdienstfeatures unterstützt werden, finden Sie unter Sprachdienstregionen.
Eine Audiodatei (weniger als 2 Stunden lang und kleiner als 200 MB) in einem der Formate und Codecs, die von der Batchtranskriptions-API unterstützt werden. Weitere Informationen zu unterstützten Audioformaten finden Sie unter unterstützte Audioformate.

Verwenden der API für schnelle Transkription

Tipp

Testen Sie die schnelle Transkription im Azure KI Foundry-Portal.

Wir erfahren, wie Sie die schnelle Transkription-API (über Transkriptionen – Transkribieren) mit den folgenden Szenarien verwenden:

Bekanntes Gebietsschema angegeben: Transkribieren einer Audiodatei mit einem angegebenen Gebietsschema. Wenn Sie das Gebietsschema der Audiodatei kennen, können Sie es angeben, um die Transkriptionsgenauigkeit zu verbessern und die Latenz zu minimieren.
Sprachidentifikation an: Transkribieren einer Audiodatei mit eingeschalteter Sprachidentifikation. Wenn Sie sich nicht sicher sind, in welchem Gebietsschema die Audiodatei vorliegt, können Sie die Sprachidentifizierung aktivieren, damit der Sprachdienst das Gebietsschema ermitteln kann.
Diarisierung an: Transkribieren einer Audiodatei mit Diarisierung. Die Diarisierung unterscheidet zwischen verschiedenen Sprechern in der Unterhaltung. Der Speech-Dienst stellt Informationen darüber bereit, welcher Sprecher einen bestimmten Teil der transkribierten Sprache gesprochen hat.
Multikanal an: Transkribieren einer Audiodatei mit einem oder zwei Kanälen. Die Transkription von mehreren Kanälen ist nützlich für Audiodateien mit mehreren Kanälen, wie z. B. Audiodateien mit mehreren Sprechern oder Audiodateien mit Hintergrundgeräuschen. Standardmäßig führt die API für die schnelle Transkription alle Eingabekanäle in einem einzigen Kanal zusammen und führt dann die Transkription durch. Wenn dies nicht gewünscht ist, können Kanäle unabhängig voneinander transkribiert werden, ohne dass sie zusammengeführt werden.

Erstellen Sie eine mehrteilige/form-data POST-Anforderung an den transcriptions-Endpunkt mit der Audiodatei und den Anforderungstexteigenschaften.

Das folgende Beispiel zeigt, wie eine Audiodatei mit einem bestimmten Gebietsschema transkribiert wird. Wenn Sie das Gebietsschema der Audiodatei kennen, können Sie es angeben, um die Transkriptionsgenauigkeit zu verbessern und die Latenz zu minimieren.

Ersetzen Sie YourSubscriptionKey durch Ihren Speech-Ressourcenschlüssel.
Ersetzen Sie YourServiceRegion durch Ihre Sprachressourcenregion.
Ersetzen Sie YourAudioFile durch den Pfad zu Ihrer Audiodatei.

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"]}"'

Erstellen Sie die Formulardefinition gemäß den folgenden Anweisungen:

Legen Sie die optionale (aber empfohlene) Eigenschaft locales fest, die mit dem erwarteten Gebietsschema der zu transkribierenden Audiodaten übereinstimmen sollte. In diesem Beispiel ist das Gebietsschema auf en-US festgelegt. Die unterstützten Gebietsschemata, die Sie angeben können sind: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, und zh-CN.

Weitere Informationen zu locales und anderen Eigenschaften für die API zur schnellen Transkription finden Sie im Abschnitt Optionen für die Konfiguration von Anforderungen weiter unten in diesem Handbuch.

Die Antwort enthält durationMilliseconds, offsetMilliseconds und weitere. Die Eigenschaft combinedPhrases enthält die vollständigen Transkriptionen für alle Sprecher.

{
	"durationMilliseconds": 182439,
	"combinedPhrases": [
		{
			"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And you're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
		}
	],
	"phrases": [
		{
			"offsetMilliseconds": 960,
			"durationMilliseconds": 640,
			"text": "Good afternoon.",
			"words": [
				{
					"text": "Good",
					"offsetMilliseconds": 960,
					"durationMilliseconds": 240
				},
				{
					"text": "afternoon.",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 400
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 1600,
			"durationMilliseconds": 640,
			"text": "This is Sam.",
			"words": [
				{
					"text": "This",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "is",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 120
				},
				{
					"text": "Sam.",
					"offsetMilliseconds": 1960,
					"durationMilliseconds": 280
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 2240,
			"durationMilliseconds": 1040,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 2240,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 2440,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 2520,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 200
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 2840,
					"durationMilliseconds": 440
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 3280,
			"durationMilliseconds": 640,
			"text": "How can I help?",
			"words": [
				{
					"text": "How",
					"offsetMilliseconds": 3280,
					"durationMilliseconds": 120
				},
				{
					"text": "can",
					"offsetMilliseconds": 3440,
					"durationMilliseconds": 120
				},
				{
					"text": "I",
					"offsetMilliseconds": 3560,
					"durationMilliseconds": 40
				},
				{
					"text": "help?",
					"offsetMilliseconds": 3600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 5040,
			"durationMilliseconds": 400,
			"text": "Hi there.",
			"words": [
				{
					"text": "Hi",
					"offsetMilliseconds": 5040,
					"durationMilliseconds": 240
				},
				{
					"text": "there.",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 160
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"offsetMilliseconds": 5440,
			"durationMilliseconds": 800,
			"text": "My name is Mary.",
			"words": [
				{
					"text": "My",
					"offsetMilliseconds": 5440,
					"durationMilliseconds": 80
				},
				{
					"text": "name",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5640,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 5720,
					"durationMilliseconds": 520
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"offsetMilliseconds": 180320,
			"durationMilliseconds": 680,
			"text": "Thank you for your help.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 180320,
					"durationMilliseconds": 160
				},
				{
					"text": "you",
					"offsetMilliseconds": 180480,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 180560,
					"durationMilliseconds": 120
				},
				{
					"text": "your",
					"offsetMilliseconds": 180680,
					"durationMilliseconds": 120
				},
				{
					"text": "help.",
					"offsetMilliseconds": 180800,
					"durationMilliseconds": 200
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		},
		{
			"offsetMilliseconds": 181960,
			"durationMilliseconds": 280,
			"text": "Thank you.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 181960,
					"durationMilliseconds": 200
				},
				{
					"text": "you.",
					"offsetMilliseconds": 182160,
					"durationMilliseconds": 80
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		}
	]
}

Erstellen Sie eine mehrteilige/form-data POST-Anforderung an den transcriptions-Endpunkt mit der Audiodatei und den Anforderungstexteigenschaften.

Das folgende Beispiel zeigt, wie Sie eine Audiodatei mit aktivierter Spracherkennung transkribieren. Wenn Sie sich bezüglich der Sprache nicht sicher sind, können Sie mehrere Sprachen angeben. Wenn Sie kein Gebietsschema angeben oder die angegebenen Gebietsschemata nicht in der Audiodatei enthalten sind, versucht der Sprachdienst, das Gebietsschema zu ermitteln.

Ersetzen Sie YourSubscriptionKey durch Ihren Speech-Ressourcenschlüssel.
Ersetzen Sie YourServiceRegion durch Ihre Sprachressourcenregion.
Ersetzen Sie YourAudioFile durch den Pfad zu Ihrer Audiodatei.

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US","ja-JP"]}"'

Erstellen Sie die Formulardefinition gemäß den folgenden Anweisungen:

Legen Sie die optionale (aber empfohlene) Eigenschaft locales fest, die mit dem erwarteten Gebietsschema der zu transkribierenden Audiodaten übereinstimmen sollte. In diesem Beispiel werden die Gebietsschemata auf en-US und ja-JP festgelegt. Die unterstützten Gebietsschemata, die Sie angeben können sind: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, und zh-CN.

Die Antwort enthält durationMilliseconds, offsetMilliseconds und weitere. Die Eigenschaft combinedPhrases enthält die vollständigen Transkriptionen für alle Sprecher.

{
	"durationMilliseconds": 185079,
	"combinedPhrases": [
		{
			"text": "Hello, thank you for calling Contoso. Who am I speaking with today? Hi, my name is Mary Rondo. I'm trying to enroll myself with Contoso. Hi, Mary. Are you calling because you need health insurance? Yes. Yeah, I'm calling to sign up for insurance. Great. Uh If you can answer a few questions, we can get you signed up in a Jiffy. Okay. So what's your full name? uh So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. Got it. And what's the best callback number in case we get disconnected? I only have a cell phone, so I can give you that. Yep, that'll be fine. Sure. So it's 234-554 and then 9312. Got it. So to confirm, it's 234-554-9312. Yep, that's right. Excellent. Let's get some additional information for your application. Do you have a job? Uh Yes, I am self-employed. Okay, so then you have a social security number as well? Uh Yes, I do. Okay, and what is your social security number, please? Uh Sure, so it's 412-253-4931. 6789. Sorry, was that a 25 or a 225? You cut out for a bit. It's double two, so 412, then another two, then five. Thank you so much. And could I have your e-mail address, please? Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. That sounds good. Thank you. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Actually, so I have one more question. Yes, of course. I'm curious, will I be getting a physical card as proof of coverage? So the default is a digital membership card, but we can send you a physical card if you prefer. Uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? Uh Yeah. uh So it's 2660 Unit A on Maple Avenue, Southeast Lansing, and then zip code is 48823. Absolutely. I've made a note on your file. Awesome. Thanks so much. You're very welcome. Thank you for calling Contoso and have a great day."
		}
	],
	"phrases": [
		{
			"offsetMilliseconds": 720,
			"durationMilliseconds": 1600,
			"text": "Hello, thank you for calling Contoso.",
			"words": [
				{
					"text": "Hello,",
					"offsetMilliseconds": 720,
					"durationMilliseconds": 480
				},
				{
					"text": "thank",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 1400,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 1480,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 2320,
			"durationMilliseconds": 1120,
			"text": "Who am I speaking with today?",
			"words": [
				{
					"text": "Who",
					"offsetMilliseconds": 2320,
					"durationMilliseconds": 160
				},
				{
					"text": "am",
					"offsetMilliseconds": 2480,
					"durationMilliseconds": 80
				},
				{
					"text": "I",
					"offsetMilliseconds": 2560,
					"durationMilliseconds": 80
				},
				{
					"text": "speaking",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 320
				},
				{
					"text": "with",
					"offsetMilliseconds": 2960,
					"durationMilliseconds": 160
				},
				{
					"text": "today?",
					"offsetMilliseconds": 3120,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 4480,
			"durationMilliseconds": 1600,
			"text": "Hi, my name is Mary Rondo.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 4480,
					"durationMilliseconds": 400
				},
				{
					"text": "my",
					"offsetMilliseconds": 4880,
					"durationMilliseconds": 120
				},
				{
					"text": "name",
					"offsetMilliseconds": 5000,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5120,
					"durationMilliseconds": 160
				},
				{
					"text": "Mary",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 240
				},
				{
					"text": "Rondo.",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 560
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		{
			"offsetMilliseconds": 6120,
			"durationMilliseconds": 1800,
			"text": "I'm trying to enroll myself with Contoso.",
			"words": [
				{
					"text": "I'm",
					"offsetMilliseconds": 6120,
					"durationMilliseconds": 120
				},
				{
					"text": "trying",
					"offsetMilliseconds": 6240,
					"durationMilliseconds": 200
				},
				{
					"text": "to",
					"offsetMilliseconds": 6440,
					"durationMilliseconds": 80
				},
				{
					"text": "enroll",
					"offsetMilliseconds": 6520,
					"durationMilliseconds": 200
				},
				{
					"text": "myself",
					"offsetMilliseconds": 6720,
					"durationMilliseconds": 360
				},
				{
					"text": "with",
					"offsetMilliseconds": 7080,
					"durationMilliseconds": 120
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 7200,
					"durationMilliseconds": 720
				}
			],
			"locale": "en-US",
			"confidence": 0.93265927
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"offsetMilliseconds": 181520,
			"durationMilliseconds": 720,
			"text": "You're very welcome.",
			"words": [
				{
					"text": "You're",
					"offsetMilliseconds": 181520,
					"durationMilliseconds": 160
				},
				{
					"text": "very",
					"offsetMilliseconds": 181680,
					"durationMilliseconds": 200
				},
				{
					"text": "welcome.",
					"offsetMilliseconds": 181880,
					"durationMilliseconds": 360
				}
			],
			"locale": "en-US",
			"confidence": 0.90571773
		},
		{
			"offsetMilliseconds": 182320,
			"durationMilliseconds": 1840,
			"text": "Thank you for calling Contoso and have a great day.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 182320,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 182520,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 182600,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 182720,
					"durationMilliseconds": 280
				},
				{
					"text": "Contoso",
					"offsetMilliseconds": 183000,
					"durationMilliseconds": 520
				},
				{
					"text": "and",
					"offsetMilliseconds": 183520,
					"durationMilliseconds": 160
				},
				{
					"text": "have",
					"offsetMilliseconds": 183680,
					"durationMilliseconds": 120
				},
				{
					"text": "a",
					"offsetMilliseconds": 183800,
					"durationMilliseconds": 40
				},
				{
					"text": "great",
					"offsetMilliseconds": 183840,
					"durationMilliseconds": 200
				},
				{
					"text": "day.",
					"offsetMilliseconds": 184040,
					"durationMilliseconds": 120
				}
			],
			"locale": "en-US",
			"confidence": 0.90571773
		}
	]
}

Erstellen Sie eine mehrteilige/form-data POST-Anforderung an den transcriptions-Endpunkt mit der Audiodatei und den Anforderungstexteigenschaften.

Das folgende Beispiel zeigt, wie Sie eine Audiodatei transkribieren, wenn die Funktion „Diarisierung“ aktiviert ist. Die Diarisierung unterscheidet zwischen verschiedenen Sprechern in der Unterhaltung. Der Speech-Dienst stellt Informationen darüber bereit, welcher Sprecher einen bestimmten Teil der transkribierten Sprache gesprochen hat.

Ersetzen Sie YourSubscriptionKey durch Ihren Speech-Ressourcenschlüssel.
Ersetzen Sie YourServiceRegion durch Ihre Sprachressourcenregion.
Ersetzen Sie YourAudioFile durch den Pfad zu Ihrer Audiodatei.

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"], 
    "diarization": {"maxSpeakers": 2,"enabled": true}}"'

Erstellen Sie die Formulardefinition gemäß den folgenden Anweisungen:

Legen Sie die optionale (aber empfohlene) Eigenschaft locales fest, die mit dem erwarteten Gebietsschema der zu transkribierenden Audiodaten übereinstimmen sollte. In diesem Beispiel ist das Gebietsschema auf en-US festgelegt. Die unterstützten Gebietsschemata, die Sie angeben können sind: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, und zh-CN.
Legen Sie die Eigenschaft diarization so fest, dass mehrere Sprecher in einem Audiokanal erkannt und getrennt werden. Geben Sie z. B. "diarization": {"maxSpeakers": 2, "enabled": true} an. Die Transkriptionsdatei enthält speaker-Einträge für jeden transkribierten Ausdruck.

Weitere Informationen zu locales, diarization und anderen Eigenschaften für die API zur schnellen Transkription finden Sie im Abschnitt Optionen für die Konfiguration von Anforderungen weiter unten in diesem Handbuch.

Die Antwort enthält durationMilliseconds, offsetMilliseconds und weitere. In diesem Beispiel ist die Diarisierung aktiviert, sodass die Antwort speaker-Informationen für jede transkribierte Phrase enthält. Die Eigenschaft combinedPhrases enthält die vollständigen Transkriptionen für alle Sprecher in einem einzigen Kanal.

{
	"durationMilliseconds": 182439,
	"combinedPhrases": [
		{
			"channel": 0,
			"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh. Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? Uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And. You're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
		}
	],
	"phrases": [
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 960,
			"durationMilliseconds": 640,
			"text": "Good afternoon.",
			"words": [
				{
					"text": "Good",
					"offsetMilliseconds": 960,
					"durationMilliseconds": 240
				},
				{
					"text": "afternoon.",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 400
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 1600,
			"durationMilliseconds": 640,
			"text": "This is Sam.",
			"words": [
				{
					"text": "This",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "is",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 120
				},
				{
					"text": "Sam.",
					"offsetMilliseconds": 1960,
					"durationMilliseconds": 280
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 2240,
			"durationMilliseconds": 1040,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 2240,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 2440,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 2520,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 200
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 2840,
					"durationMilliseconds": 440
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 3280,
			"durationMilliseconds": 640,
			"text": "How can I help?",
			"words": [
				{
					"text": "How",
					"offsetMilliseconds": 3280,
					"durationMilliseconds": 120
				},
				{
					"text": "can",
					"offsetMilliseconds": 3440,
					"durationMilliseconds": 120
				},
				{
					"text": "I",
					"offsetMilliseconds": 3560,
					"durationMilliseconds": 40
				},
				{
					"text": "help?",
					"offsetMilliseconds": 3600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 5040,
			"durationMilliseconds": 400,
			"text": "Hi there.",
			"words": [
				{
					"text": "Hi",
					"offsetMilliseconds": 5040,
					"durationMilliseconds": 240
				},
				{
					"text": "there.",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 160
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 5440,
			"durationMilliseconds": 800,
			"text": "My name is Mary.",
			"words": [
				{
					"text": "My",
					"offsetMilliseconds": 5440,
					"durationMilliseconds": 80
				},
				{
					"text": "name",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5640,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 5720,
					"durationMilliseconds": 520
				}
			],
			"locale": "en-US",
			"confidence": 0.93616915
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"channel": 0,
			"speaker": 0,
			"offsetMilliseconds": 180320,
			"durationMilliseconds": 680,
			"text": "Thank you for your help.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 180320,
					"durationMilliseconds": 160
				},
				{
					"text": "you",
					"offsetMilliseconds": 180480,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 180560,
					"durationMilliseconds": 120
				},
				{
					"text": "your",
					"offsetMilliseconds": 180680,
					"durationMilliseconds": 120
				},
				{
					"text": "help.",
					"offsetMilliseconds": 180800,
					"durationMilliseconds": 200
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		},
		{
			"channel": 0,
			"speaker": 1,
			"offsetMilliseconds": 181960,
			"durationMilliseconds": 280,
			"text": "Thank you.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 181960,
					"durationMilliseconds": 200
				},
				{
					"text": "you.",
					"offsetMilliseconds": 182160,
					"durationMilliseconds": 80
				}
			],
			"locale": "en-US",
			"confidence": 0.9314801
		}
    ]
}

Erstellen Sie eine mehrteilige/form-data POST-Anforderung an den transcriptions-Endpunkt mit der Audiodatei und den Anforderungstexteigenschaften.

Das folgende Beispiel zeigt, wie eine Audiodatei mit einem oder zwei Kanälen transkribiert wird. Die Transkription von mehreren Kanälen ist nützlich für Audiodateien mit mehreren Kanälen, wie z. B. Audiodateien mit mehreren Sprechern oder Audiodateien mit Hintergrundgeräuschen. Standardmäßig führt die API für die schnelle Transkription alle Eingabekanäle in einem einzigen Kanal zusammen und führt dann die Transkription durch. Wenn dies nicht gewünscht ist, können Kanäle unabhängig voneinander transkribiert werden, ohne dass sie zusammengeführt werden.

Ersetzen Sie YourSubscriptionKey durch Ihren Speech-Ressourcenschlüssel.
Ersetzen Sie YourServiceRegion durch Ihre Sprachressourcenregion.
Ersetzen Sie YourAudioFile durch den Pfad zu Ihrer Audiodatei.

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"], 
    "channels": [0,1]}"'

Erstellen Sie die Formulardefinition gemäß den folgenden Anweisungen:

Legen Sie die optionale (aber empfohlene) Eigenschaft locales fest, die mit dem erwarteten Gebietsschema der zu transkribierenden Audiodaten übereinstimmen sollte. In diesem Beispiel ist das Gebietsschema auf en-US festgelegt. Die unterstützten Gebietsschemata, die Sie angeben können sind: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, und zh-CN.
Legen Sie die Eigenschaft channels fest, um die nullbasierten Indizes der Kanäle anzugeben, die separat transkribiert werden sollen. Es werden bis zu zwei Kanäle unterstützt, es sei denn, die Diarisierung ist aktiviert. In diesem Beispiel werden die Kanäle 0 und 1 angegeben.

Weitere Informationen zu locales, channels und anderen Eigenschaften für die API zur schnellen Transkription finden Sie im Abschnitt Optionen für die Konfiguration von Anforderungen weiter unten in diesem Handbuch.

Die Antwort enthält durationMilliseconds, offsetMilliseconds und weitere. Die Eigenschaft channel identifiziert den Kanal, wenn die Audiodatei mehrere Kanäle enthält. Die Eigenschaft combinedPhrases enthält vollständige Transkriptionen, die nach Audiokanälen getrennt sind. Suchen Sie nach "channel": 0,"text" und "channel": 1,"text", um die vollständigen Transkriptionen für jeden Kanal zu identifizieren.

{
	"durationMilliseconds": 185079,
	"combinedPhrases": [
		{
			"channel": 0,
			"text": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
		},
		{
			"channel": 1,
			"text": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
		}
	],
	"phrases": [
		{
			"channel": 0,
			"offsetMilliseconds": 720,
			"durationMilliseconds": 480,
			"text": "Hello.",
			"words": [
				{
					"text": "Hello.",
					"offsetMilliseconds": 720,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 1200,
			"durationMilliseconds": 1120,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offsetMilliseconds": 1200,
					"durationMilliseconds": 200
				},
				{
					"text": "you",
					"offsetMilliseconds": 1400,
					"durationMilliseconds": 80
				},
				{
					"text": "for",
					"offsetMilliseconds": 1480,
					"durationMilliseconds": 120
				},
				{
					"text": "calling",
					"offsetMilliseconds": 1600,
					"durationMilliseconds": 240
				},
				{
					"text": "Contoso.",
					"offsetMilliseconds": 1840,
					"durationMilliseconds": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 2320,
			"durationMilliseconds": 1120,
			"text": "Who am I speaking with today?",
			"words": [
				{
					"text": "Who",
					"offsetMilliseconds": 2320,
					"durationMilliseconds": 160
				},
				{
					"text": "am",
					"offsetMilliseconds": 2480,
					"durationMilliseconds": 80
				},
				{
					"text": "I",
					"offsetMilliseconds": 2560,
					"durationMilliseconds": 80
				},
				{
					"text": "speaking",
					"offsetMilliseconds": 2640,
					"durationMilliseconds": 320
				},
				{
					"text": "with",
					"offsetMilliseconds": 2960,
					"durationMilliseconds": 160
				},
				{
					"text": "today?",
					"offsetMilliseconds": 3120,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offsetMilliseconds": 9520,
			"durationMilliseconds": 400,
			"text": "Hi, Mary.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 9520,
					"durationMilliseconds": 80
				},
				{
					"text": "Mary.",
					"offsetMilliseconds": 9600,
					"durationMilliseconds": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		// More transcription results...
	    // Redacted for brevity
		{
			"channel": 1,
			"offsetMilliseconds": 4480,
			"durationMilliseconds": 1600,
			"text": "Hi, my name is Mary Rondo.",
			"words": [
				{
					"text": "Hi,",
					"offsetMilliseconds": 4480,
					"durationMilliseconds": 400
				},
				{
					"text": "my",
					"offsetMilliseconds": 4880,
					"durationMilliseconds": 120
				},
				{
					"text": "name",
					"offsetMilliseconds": 5000,
					"durationMilliseconds": 120
				},
				{
					"text": "is",
					"offsetMilliseconds": 5120,
					"durationMilliseconds": 160
				},
				{
					"text": "Mary",
					"offsetMilliseconds": 5280,
					"durationMilliseconds": 240
				},
				{
					"text": "Rondo.",
					"offsetMilliseconds": 5520,
					"durationMilliseconds": 560
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
		{
			"channel": 1,
			"offsetMilliseconds": 6080,
			"durationMilliseconds": 1920,
			"text": "I'm trying to enroll myself with Contuso.",
			"words": [
				{
					"text": "I'm",
					"offsetMilliseconds": 6080,
					"durationMilliseconds": 160
				},
				{
					"text": "trying",
					"offsetMilliseconds": 6240,
					"durationMilliseconds": 200
				},
				{
					"text": "to",
					"offsetMilliseconds": 6440,
					"durationMilliseconds": 80
				},
				{
					"text": "enroll",
					"offsetMilliseconds": 6520,
					"durationMilliseconds": 200
				},
				{
					"text": "myself",
					"offsetMilliseconds": 6720,
					"durationMilliseconds": 360
				},
				{
					"text": "with",
					"offsetMilliseconds": 7080,
					"durationMilliseconds": 120
				},
				{
					"text": "Contuso.",
					"offsetMilliseconds": 7200,
					"durationMilliseconds": 800
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
		// More transcription results...
	    // Redacted for brevity
    ]
}

Konfigurationsoptionen für Anforderungen

Hier sind einige Optionen für Eigenschaften, um eine Transkription zu konfigurieren, wenn Sie den Vorgang Transkriptionen – Transkribieren aufrufen.

Eigenschaft	Beschreibung	Erforderlich oder optional
`channels`	Die Liste der nullbasierten Indizes der Kanäle anzugeben, die separat transkribiert werden sollen. Es werden bis zu zwei Kanäle unterstützt, es sei denn, die Diarisierung ist aktiviert. Standardmäßig führt die API für die schnelle Transkription alle Eingabekanäle in einem einzigen Kanal zusammen und führt dann die Transkription durch. Wenn dies nicht gewünscht ist, können Kanäle unabhängig voneinander transkribiert werden, ohne dass sie zusammengeführt werden. Wenn Sie die Kanäle aus einer Stereo-Audiodatei separat transkribieren möchten, müssen Sie `[0,1]`, `[0]` oder `[1]` angeben. Andernfalls wird Stereo-Audio zu Mono zusammengeführt und nur ein einziger Kanal transkribiert. Wenn das Audio Stereo und Diarisierung aktiviert ist, können Sie die Eigenschaft `channels` nicht auf `[0,1]`festlegen. Der Speech-Dienst unterstützt keine Diarisierung über mehrere Kanäle. Bei Mono-Audio wird die -Eigenschaft `channels` ignoriert, und die Audiodaten werden immer als einzelner Kanal transkribiert.	Optional
`diarization`	Die Diarisierungskonfiguration. Diarisierung ist der Prozess des Erkennens und Trennens mehrerer Sprecher in einem Audiokanal. Geben Sie z. B. `"diarization": {"maxSpeakers": 2, "enabled": true}` an. Die Transkriptionsdatei enthält `speaker`-Einträge (wie `"speaker": 0` oder `"speaker": 1`) für jeden transkribierten Ausdruck.	Optional
`locales`	Die Liste der Gebietsschemata, die mit dem erwarteten Gebietsschema der Audiodaten übereinstimmen sollten, die Sie transkribieren möchten. Wenn Sie das Gebietsschema der Audiodatei kennen, können Sie es angeben, um die Transkriptionsgenauigkeit zu verbessern und die Latenz zu minimieren. Wenn ein einzelnes Gebietsschema angegeben wird, wird dieses Gebietsschema für die Transkription verwendet. Wenn Sie sich bezüglich des Gebietsschemas nicht sicher sind, können Sie mehrere Gebietsschemata angeben. Die Sprachidentifikation ist mit einer präziseren Liste von Kandidatengebietsschemata möglicherweise genauer. Wenn Sie kein Gebietsschema angeben oder die angegebenen Gebietsschemata nicht in der Audiodatei enthalten sind, versucht der Speech-Dienst dennoch, die Sprache zu identifizieren. Wenn die Sprache nicht identifiziert werden kann, wird ein Fehler zurückgegeben. Die unterstützten Gebietsschemata, die Sie angeben können sind: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, und zh-CN. Sie können die neuesten unterstützten Sprachen über die REST-API Transkriptionen – Unterstützte Gebietsschemata auflisten abrufen. Weitere Informationen zu Gebietsschemata finden Sie in der Dokumentation Sprachunterstützung für de Speech-Dienst.	Optional, aber empfohlen, wenn Sie das erwartete Gebietsschema kennen.
`profanityFilterMode`	Gibt den Umgang mit Obszönitäten in Erkennungsergebnissen an. Zulässige Werte sind: `None` (deaktiviert den Obszönitätenfilter), `Masked` (Obszönitäten werden durch Sternchen ersetzt), `Removed` (Obszönitäten werden aus dem Ergebnis entfernt) und `Tags` (fügt Tags für Obszönitäten ein). Der Standardwert ist `Masked`.	Optional

Freigeben über

Verwenden der API für schnelle Transkription mit Azure KI Speech

Voraussetzungen

Verwenden der API für schnelle Transkription

Konfigurationsoptionen für Anforderungen

Feedback

Zusätzliche Ressourcen

Freigeben über

Verwenden der API für schnelle Transkription mit Azure KI Speech

Voraussetzungen

Verwenden der API für schnelle Transkription

Konfigurationsoptionen für Anforderungen

Zugehöriger Inhalt

Feedback

Zusätzliche Ressourcen