Subscript and Superscript Bases

For proper math typography, it’s important to know the base of a subscript or superscript expression. For example, in Einstein’s equation E = mc2, the superscript expression c2 appears and c is the base, not mc. Knowing what the base is allows proper kerning of the base relative to the script (superscript or subscript) as well as providing more accurate semantics in interoperating with mathematical calculation engines.

This post describes the subscript/superscript base rules used by Word 2007 and RichEdit 6 in building up math text from the linear format. The rules are good, but not infallible, and users can overrule them either directly in the linear format or after they are built up into the Professional format.

Unicode math alphabetics: Ordinarily when a user types an ASCII letter or a Greek lower case letter α..ω (along with some variants), the letter is automatically converted to the corresponding Unicode math italic letter. These special mathematical letters, along with the basic set of Latin letters in Fraktur, script, and open-face math styles, are reserved for mathematical variables . Accordingly if a subscript or superscript follows such a letter, that letter is considered to be the base. In linear format if you type E=mc^2<space>, you get E = mc2, where the letters are given by math italic characters (not used here in this blog post). In particular, c would be given by the math italic c, U+1D450, rather than by the ASCII c, U+0063. This single math italic c is the base of the superscript expression c2. For more information on the math alphabetics, please see Section 2.1 of the Unicode Technical Report #25.

Numbers: A consecutive string of ASCII digits is treated as a base. So in the expression 1002, the 100 is the base of the superscript expression and has the mathematical meaning of “one hundred squared”. This quantity is typed in as 100^2.

ASCII letter strings: Since mathematical variables are almost always represented by math alphabetics, a consecutive string of ASCII letters is treated as a base. So in the superscript expression sin-1, the base is “sin”. Actually this case is usually handled by the function name mechanism described next. You can enter an ASCII letter string by turning off the italic button before you type or by selecting the corresponding math italic letters and then turning off the italic button. Be sure to turn the italic button back on if you want to enter math italic variables.

Function names: when a consecutive string of English alphabetics is typed followed by a space or bracket of some kind, the resulting math italic string is “folded” down to the corresponding ASCII letter string and compared to entries in a mathematical function dictionary. If found, the folded version of the string is used followed by the function-apply operator U+2061. The dictionary includes trigonometric functions like sin, cos, tan, etc., along with many other famous math function names. Users can modify this dictionary. If the function-apply operator is then followed by a subscript or superscript, that script is transferred to the function name, and the function name becomes the base of the script expression. This is handy for typing in expressions like sin-1x.

Embellished operators: If an operator character precedes a subscript or superscript, the operator is the base. For example, in the expression +­2, the + is the base.

Built-up math objects: If a built-up math object such as a stacked fraction precedes a subscript or superscript, that object is the base.

Superscript a subscript object: Exceptions to the rule above occur for superscripting a subscript object and subscripting a superscript object. In both of these cases, the combination is turned into a subsup object, which has special typography, typically placing the superscript over the subscript.

Opaque strings: Opaque strings are whatever is inside a \begin \end expression. Such strings are bases if followed by a subscript or superscript. This is the catch-all method of letting most any mathematical text be a subscript/superscript base. The user is cautioned to use reasonable choices so that the result is understandable to readers.

Complex script characters: In Indic scripts like Devanagari, a number of Unicode characters may be combined to form a character “cluster”. If such a cluster is followed by a subscript or superscript, the cluster becomes the base. However, this doesn’t occur for Arabic ligatures, for which only the last character is treated as the base. One can force the whole ligature to be the base by putting it inside a \begin \end expression, i.e., by making it an opaque string.

Ordinary text: Expressions resulting from the linear format “rate” are called ordinary text and are useful as variables when you want to spell out the variables’ names. Such ordinary text strings are treated as bases.

Comments