Phoneme segmentation
In a recent comment, James Salsman wrote “SAPI 4.0a had phoneme segmentation” and he asks that we put it back into our newer APIs. (You can see more about SAPI 4 here).
It’s been a long time since we made an API with this functionality. I’m curious to know whether anybody else would like to see this, and just as importantly, what scenarios it would enable for you.
Comments
- Anonymous
February 22, 2005
by phoneme segmentation, do you mean i would say "hello" and it would return some string like "HH AH L OW"? this would be kind of like the InkDivider class in the Tablet API.
if so ... how about for speech therapy or learning a new language? i could speak the word the way i thought it was pronounced, and SAPI could return the phonemes that it heard. then my app could compare it against what it expected and correct me. if it returned what phoneme was stressed, as well as confidence ... then you could do cooler stuff. e.g. give them some rating about how much they sound like a native speaker.
or if i could get access to the phonemes as wav segments (or timings) for the actual wav, then maybe i could muck with those wav segments and write some sort of speaker verification biometric? - Anonymous
February 23, 2005
Casey mentioned the educational uses that I'm most interested in. Enabling the ability to more effectively teach beginning reading or a second language is very important. With phoneme segmentation, you can zoom in on pronunciation errors, and compare likely alternatives to each phoneme in a word, pinpointing mispronunciations. My product ( www.readsay.com/pro.htm ) does this.
However, there are other industrial applications. Suppose you have a vocabulary of 1,000 names, 10 pairs of which are similar enough that wrong numbers are unacceptably frequent. You can use phoneme segmentation to "zoom in" on the most discriminant portion of each of the 20 suspect names, using a second pass recognition on the discriminant segment (and the nerighboring phonemes; in my experience trying to do recognition on a single phoneme segment hardly ever works) to make the final decision.
Another example is use by the speech scientists who use recognition to automate manual speech transcription. It is much easier and more reliable to have two human proofreadings of the same imperfect machine transcription (and resolve any conflicts with a third) than it is to have human(s) do the entire transcription by hand.
For a completely unrelated example, phoneme segmentation can help animators automate mouth movements keyed to a voice recording. The dollar size of the animation industry is probably bigger than the speech-based educational software, IVR, and speech science industries combined. - Anonymous
February 24, 2005
The comment has been removed