Share via


Why can't we strip the diacritics?

We have some "best-fit" behavior which we generally consider to be "bad".  Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don't lose anything).  Assuming you can't use Unicode, why is it so bad to just make everything ASCII-like?  Maybe you have a published house or direct marketing firm that can't handle Unicode, so you'll just get rid of those annoying decorations.

In American English the diacritics are effectively quaint decorations.  Many people naïvely assume that when word auto-corrects naive to naïve that this is just a prettiness factor.  When they resume spell checking their résumé the diacritics become more important.  In English its fair to spell résumé as resume, but it seems cooler to add the accents.  Since we stole (borrowed is more politically correct) the word from French, we have a french-like pronunciation of résumé, and aren't likely to confuse it with resume.

In most other languages diacritics aren't optional.  You wouldn't exchange a z with an s in english just because they look similar.  "A real singer" is a lot different than "a real zinger".

Recently I encountered the the following example, a user wanted to get around those pesky diacritics by mapping to ASCII.

The suggested input was:
    último año de carrera

The desired output was:
    ultimo ano de carrera

My Spanish is nearly non-existent, however word's spell checker tells me these are all legitimate Spanish words, even without the accents.  The meaning goes from something like "the last year of the race" to "I completed the anus of the race."

Now imagine that you're trying to reach a new market and you do that to your customer's names or potential customer's names, how long will they remain your customer?

- Shawn

Comments

  • Anonymous
    June 09, 2007
    The comment has been removed

  • Anonymous
    August 06, 2007
    I'm reading some data from my database  and write those data to a csv file using a StreamWriter. I have diacritics in some of the fields. eg:-Château, Viña when I view the csv file from a notepad it shows diacritics correctly as they are. but when I view the csv file from excel it shows some funny chatecters as, Château, Viña what is the reason for this and how can I overcome this problem?

  • Anonymous
    September 10, 2007
    StreamWriter should be using UTF-8 as the default output (which is good, don't change that).  Notepad recognizes that and shows you the UTF-8 data, but it sounds like excel isn't doing that. I think that changing "File Origin:" in excel to "65001: Unicode (UTF-8)" (They sort alphabetically by encoding name) you'll solve your problem.  Not working for Office I'm not sure, nor do I know if this is the same in all versions.

  • Anonymous
    June 26, 2008
    When I receive E-mail that has the Unicode(UTF-8) at he right heading aI can never downlow it

  • Anonymous
    June 27, 2008
    Bummer, is your email client a Microsoft product? (I could pass along a bug). Email unfortunately is a bit picky about encodings and code pages.  UTF-8 isn't particularly special in this way.  Some phones for example only support certain encodings.   The good news is that the IETF has an  EAI (Email Address Internationalization) working group - http://www.ietf.org/html.charters/eai-charter.html   The EAI is working toward enabling UTF-8 email throught the system, including the local part and body of the email.  It'll take a while for the standard to get implimented, but at least its progress.