Why can't we strip the diacritics?
We have some "best-fit" behavior which we generally consider to be "bad". Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don't lose anything). Assuming you can't use Unicode, why is it so bad to just make everything ASCII-like? Maybe you have a published house or direct marketing firm that can't handle Unicode, so you'll just get rid of those annoying decorations.
In American English the diacritics are effectively quaint decorations. Many people naïvely assume that when word auto-corrects naive to naïve that this is just a prettiness factor. When they resume spell checking their résumé the diacritics become more important. In English its fair to spell résumé as resume, but it seems cooler to add the accents. Since we stole (borrowed is more politically correct) the word from French, we have a french-like pronunciation of résumé, and aren't likely to confuse it with resume.
In most other languages diacritics aren't optional. You wouldn't exchange a z with an s in english just because they look similar. "A real singer" is a lot different than "a real zinger".
Recently I encountered the the following example, a user wanted to get around those pesky diacritics by mapping to ASCII.
The suggested input was:
último año de carrera
The desired output was:
ultimo ano de carrera
My Spanish is nearly non-existent, however word's spell checker tells me these are all legitimate Spanish words, even without the accents. The meaning goes from something like "the last year of the race" to "I completed the anus of the race."
Now imagine that you're trying to reach a new market and you do that to your customer's names or potential customer's names, how long will they remain your customer?
- Shawn
Comments
Anonymous
June 09, 2007
The comment has been removedAnonymous
August 06, 2007
I'm reading some data from my database and write those data to a csv file using a StreamWriter. I have diacritics in some of the fields. eg:-Château, Viña when I view the csv file from a notepad it shows diacritics correctly as they are. but when I view the csv file from excel it shows some funny chatecters as, Château, Viña what is the reason for this and how can I overcome this problem?Anonymous
September 10, 2007
StreamWriter should be using UTF-8 as the default output (which is good, don't change that). Notepad recognizes that and shows you the UTF-8 data, but it sounds like excel isn't doing that. I think that changing "File Origin:" in excel to "65001: Unicode (UTF-8)" (They sort alphabetically by encoding name) you'll solve your problem. Not working for Office I'm not sure, nor do I know if this is the same in all versions.Anonymous
June 26, 2008
When I receive E-mail that has the Unicode(UTF-8) at he right heading aI can never downlow itAnonymous
June 27, 2008
Bummer, is your email client a Microsoft product? (I could pass along a bug). Email unfortunately is a bit picky about encodings and code pages. UTF-8 isn't particularly special in this way. Some phones for example only support certain encodings. The good news is that the IETF has an EAI (Email Address Internationalization) working group - http://www.ietf.org/html.charters/eai-charter.html The EAI is working toward enabling UTF-8 email throught the system, including the local part and body of the email. It'll take a while for the standard to get implimented, but at least its progress.