Udostępnij za pośrednictwem


Weird F020-F0FF characters in Word’s RTF

People have been inquiring about Word RTF’s occasional use of the Unicode Private Use Area (PUA) characters in the range U+F020..U+F0FF. These codes are also used in WordProcessingML defined by the ECMA-376 standard. This post explains what Word means by those characters. But first note a couple of things:

 

1) Unicode assigns no meaning to characters in the PUA, that is, those in the range U+E000..U+F7FF. So it’s up to a higher-level protocol to define the meaning. In general it’s a really bad idea to use the PUA if you’re interested in data interchange, because the program that reads such data may well display nothing or display completely different characters than you intended. That’s why it’s called “private use”, something for you and your friends who are in cahoots with you.

 

2) The original syntax of an RTF control word defines the numeric parameter to be a signed 16-bit decimal number. For most control words that have a numeric parameter, Word does use a signed 16-bit decimal number. In particular, for the \uN Unicode control word, N has this format. If the high bit of a 16-bit number is 1, the number is negative and this is true for all codes in the range U+8000..U+FFFF. To get the RTF 16-bit signed decimal values, convert Unicode hex values to decimal and if greater than 32767, subtract 65536. Accordingly U+F020 is represented by \u-4064 and U+F0FF by \u-3841. It’s true that later on Word learned that 32-bit numbers exist and so some more recent RTF control words like \rsid (revision save IDs) have parameters much larger than 65536, let alone 32767 (the most positive 16-bit signed number). RichEdit even supports reading \uN with N being the decimal UTF-32 value corresponding to a surrogate pair (now isn’t that cool?!)

 

Given the strong recommendation not to use the PUA, why would Word nevertheless go ahead and use it? If the choice were made today, I seriously doubt that Word would, but back in 1995 when Word started switching to Unicode, it wasn’t so obvious. Furthermore it solved a pesky problem with special nonUnicode fonts known as “symbol fonts”, or more precisely symbol-charset fonts. By their very definition, these fonts do not use Unicode code points. So while U+0041 stands for ‘A’ in a Unicode font, in a symbol-charset font like Wingdings, it stands for whatever character has hex code 0041, namely for Wingdings A. You must agree that A looks nothing like ‘A’, so the Word 97 folks decided to give it a distinct value, namely F000 + 41 = F041. This is also the value that Microsoft TrueType symbol-charset fonts use in the Unicode cmap (character-to-glyph mapping table). Often a symbol-charset character is defined by a SYMBOL field with a character code in the range 20 to FF.

 

A key point here is that Word RTF may treat any symbol-charset character this way, so merely getting a character in the range U+F020..U+F0FF does not mean you know which symbol-charset font is involved. For that you need to find the last symbol-charset font control word \fN, look up font N in the font table and find its face name. The charset is specified by the \fcharsetN control word and the symbol-charset is N = 2. In contrast, RichEdit does not use U+F020..U+F0FF for characters in symbol-charset fonts; it uses the native values 0020 through 00FF, and both RichEdit and Word read the resulting RTF just fine. In many cases Word, too, uses the range 0020 through 00FF for symbol-charset font characters, so Word's use of F020 through F0FF isn't exclusive.

 

For math probably the most relevant symbol-charset font is the Symbol font itself, since it has most of the Greek letters used in math along with some useful math operators and operator pieces. But since Unicode has nearly 100 times as many math characters and includes all 224 characters in the Symbol font, the Symbol font is basically useless for math at this point in time. Read: avoid it if you can J

Comments

  • Anonymous
    January 23, 2008
    Perhaps Windows/Office should hide Symbol and like fonts by default in the future? I've seen many users who use Symbol font, unaware that the same glyphs they seek are available in the Unicode typefaces they normally type in. On a related note, the font list in the next version of Office could use some thinning. In 2007 + Vista, half of the fonts seem to be obsolete carryovers from earlier versions, non-script fonts, or intended for languages that the user does not use (e.g. symbol-charset fonts like Symbol, Marlett; raster fonts from Windows 3.0, such as Courier and Modern; legacy printer fonts, such as Univers; and scores of Arabic, Hebrew, etc. fonts that only have very rudimentary Roman letters.) This makes the  list unwieldy and contributes to bad design, as users select faces that are not optimal for their script.

  • Anonymous
    January 23, 2008
    I agree that we need to do something to encourage users to use modern fonts. The Symbol font itself is an anacronism and should not be used any more. All the common Western fonts like Times New Roman and Arial have all the relevant characters, not to mention Cambria Math which goes light years beyond the Symbol font. Ditto for the other ancient fonts you mention. Thanks Murray

  • Anonymous
    January 24, 2008
    You might already know this, but Word actually has a built-in command to switch to the Symbol font (press CTRL+SHIFT+Q.) Depending on its use, it seems like a candidate for removal or replacement (with a "switch Roman to Greek letter in same font" command.)

  • Anonymous
    January 24, 2008
    While it makes sense not to use the Symbol font for Greek or even math characters, there will always be characters that are not in Unicode and, therefore, symbol fonts (lower-case 's' on purpose) are still needed. Fonts are not only a repository for characters used in human languages but an OS level mechanism for displaying interesting glyphs in text strings.

  • Anonymous
    January 25, 2008
    The comment has been removed

  • Anonymous
    February 14, 2008
    The comment has been removed

  • Anonymous
    November 29, 2016
    I am no longer positive where you're getting your info, however good topic. I needs to spend some time finding out more or understanding more. Thank you for magnificent information I was searching for this info for my mission. first person shooter online game http://rexuiz.top/