How do you pronounce the digit "1"?
This is an interesting, if not challenging problem for TTS systems. So, how should you pronounce "1"? Should an English TTS system simply say "one" when it encounters "1"? (Okay, the answer is "no", and here's why:)
Let's look at the following English sentences:
(a) I have 1 friend. ("i have one friend")
(b) Can you meet me on 1/3/05? ("can you meet me on january third two thousand and five?")
(c) My birthday is on March 1. ("my birthday is on march first.")
In sentence (a), we can read the digit "1" as "one". But in sentence (b), the digit "1" can be spoken as "January" because it is in the context of a date. And in sentence (c), the "1" commonly takes on an ordinal reading by being pronounced as "first".
This problem of properly disambiguating written text into correct word expansions is known as text normalization. For TTS systems, it's very problematic depending on the context. While digits are most often affected in terms of frequency of occurance, other kinds of patterns are also very problematic (I'll save this for another post).
So, the digit "1" has a couple of pronounciations as just seen in English. But what about in another language such as Spanish?
Let's look at similar Spanish sentences:
(d) Yo tengo 1 amigo. ("i have one friend") - "1" pronounced as "un"
(e) Yo tengo 1 amiga. ("i have one friend") - "1" pronounced as "una"
(f) Yo tengo 1. ("i have one") - "1" pronounced as "un"
In these examples, the "1" can take on three different pronounciations! It's not the semantic context (e.g., "date", "time", "fraction") that requires disambiguation, but rather, the context is the gender of the word that the "1" modifies - or the part of speech of the "1". So, in sentence (d) the pronunciation is "un" because the following noun is masculine, but in sentence (e) the pronunciation is "una" because "amiga" is feminine. But, in (f), the pronunciation is "uno" because the "1" is itself acting as a noun.
Can you see why proper identification of part of speech is so important for text normalization?
You might be thinking of how to write a rule to capture the distinctions in sentences (d), (e), and (f). (For example, if there is a noun following the digit "1" then read the "1" as "una" or "un", otherwise, read the "1" as "uno".); however, can your rule account for the following long distance dependency?
(g) De todas las chicas en la clase, hay 1 que me gusta.
Does your TTS system get (g) correct? (Note, the "1" should be readout as "una".)