UnicodeMath

Artikel
09/07/2016

In writing the post Nemeth Braille—the first math linear format, I became increasingly aware that the Unicode Nearly Plain Text Encoding of Mathematics needed a better name than “linear format”. In addition to the Nemeth braille linear format, there are other math linear formats some of which are described in the post Linear Format Notations for Mathematics. One of these, AsciiMath, inspired the name UnicodeMath, which is much more specific than “linear format”. UnicodeMath uses the Unicode math symbol set (see Math property in DerivedCoreProperties.txt), resembles real mathematical notation the most closely of all math linear formats, and handles almost every mathematical notation. Since Unicode characters are global by nature, UnicodeMath doesn’t need localization. [La]TeX and AsciiMath define characters using ASCII control words that are Western-centric, and perhaps need to be localized in nonLatin-based locales.

Advantages

In addition to being the most readable linear format, UnicodeMath is the most concise. It represents the simple fraction, one half, by the 3 characters “1/2”, whereas typical MathML takes 62 characters (consisting of the <mml:mfrac> entity). This conciseness makes UnicodeMath an attractive format for storing mathematical expressions and equations, as well as for ease of keyboard entry. Another comparison is in the math structures for the Equation Tools tab in the Office ribbon. In Word, the structures are defined in OMML (Office MathML) and built up by Word, while for the other apps, the structures are defined in UnicodeMath and built up by RichEdit. The latter are much faster and the equation data much smaller. A dramatic example is the stacked fraction template (empty numerator over empty denominator). In UnicodeMath, this is given by the single character ‘/’. In OMML, it’s 109 characters! LaTeX is considerably shorter at 9 characters “\frac{}{}”, but is still 9 times longer than UnicodeMath. AsciiMath represents fractions the same way as UnicodeMath, so simple cases are identical. If Greek letters or other characters that require names in AsciiMath are used, UnicodeMath is shorter and more readable.

Another advantage of UnicodeMath over MathML and OMML is that UnicodeMath can be stored anywhere Unicode text is stored. When adding math capabilities to a program, XML formats require redefining the program’s file format and potentially destabilizing backward compatibility, while UnicodeMath does not. If a program is aware of UnicodeMath math zones (see Section 3.20 of UnicodeMath), it can recover the built-up mathematics by passing those zones through the RichEdit UnicodeMath MathBuildUp function. In fact, you can roundtrip RichEdit documents containing math zones through the plain-text editor Notepad: the math zones are preserved!

AsciiMath

As its name implies, AsciiMath uses only ASCII characters, although it converts to MathML with access to a much larger character set. AsciiMath is relatively simple to parse and can handle many mathematical constructs. AsciiMath shares some methodology with UnicodeMath, such as eliminating the outer parentheses in fractions like (a+b)/c when converting to built-up format. AsciiMath is designed to work with a MathML renderer, such as MathJax. In Microsoft Office apps, UnicodeMath builds up to the LineServices math internal format, which represented externally by OMML.

Math autocorrect

By default, the Office math autocorrect facility contains most [La]TeX math symbol control word definitions such as \beta for β. AsciiMath has a subset of such control words but omits the leading backslash. The user can modify such control words in the Office math autocorrect list or add them explicitly, but it’d probably be worth adding an option to make the leading backslash optional. That would speed up keyboard entry of UnicodeMath via math autocorrect. The RichEdit dll includes the UnicodeMath build up/down facility as well as converters for other math formats, such as MathML and OMML. It would be straightforward to add an option to the RichEdit UnicodeMath facility to accept AsciiMath input in general. Such an option would be handy for people that know AsciiMath.

One C++ oriented autocorrect choice in AsciiMath is that typing != enters ≠. Although I program in C++ almost every day, I think /= is a better choice for entering ≠. For one thing, using != for ≠ complicates typing in an equation like n! = n(n-1)(n-2)…1, which is the main reason we didn’t implement it. But in Office apps this equation can also be entered by typing ! = instead of !=, since math spacing rules insert space between ! and = and the RichEdit UnicodeMath facility automatically deletes a user’s space if typed there (see User Spaces in Math Zones). So, that’s an easy work around for entering an n! equation if one wants to support != for ≠. The RichEdit UnicodeMath facility supports most Unicode negated operators by sequences of / followed by the corresponding unnegated operator as described in the post Negated Operators.

<gripe> Meanwhile the C++ language should recognize ≠, ≤, ≥, and ≡ as aliases for !=, <=, >=, and ==. It seems primitive that C++ doesn’t do so in this Unicode age of computing. At least the C++ editing/debugging environments should have an option to display !=, <=, >=, and == as ≠, ≤, ≥, and ≡. </gripe>

Comparison

Here’s a table with various formats for the integral

integral

Format	Representation
UnicodeMath	1/2𝜋 ∫_0^2𝜋▒ⅆ𝜃/(𝑎+𝑏 sin⁡𝜃 )=1/√(𝑎^2−𝑏^2 )
AsciiMath	1/(2pi) int_0^(2pi) dx/(a+bsin theta)=1/sqrt(a^2-b^2)
LaTeX	\frac{1}{2\pi}\int_{0}^{2\pi}\frac{d\theta}{a+b\sin {\theta}}=\frac{1}{\sqrt{a^2-b^2}}

Note that UnicodeMath binds the integrand to the integral, whereas AsciiMath and LaTeX don’t define the limits of the integrand. The Presentation MathML and OMML for this integral are too long to put into this post.

Observations

There is a unicode-math conversion package for Unicode enabled XeTeX and LuaLaTeX. The name UnicodeMath seems sufficiently different from unicode-math that there shouldn’t be any confusion between the two. The unicode-math package supports a variety of math fonts including Cambria Math, Minion Math, Latin Modern Math, TeX Gyre Pagella Math, Asana-Math, Neo-Euler, STIX, and XTIS Math. Did you know there are so many math fonts?

Enjoy the new name UnicodeMath. I am and it already appears near the end of my previous blog post, Nemeth Braille Alphanumerics and Unicode Math Alphanumerics. If you’re interested in the origin of UnicodeMath, read the post How I got into technical WP. The forerunner of UnicodeMath originated back in the early microcomputer days and had only 512 characters consisting of upright ASCII, italics, script, Greek and various mathematical symbols used in theoretical physics. Unicode 1.0 didn’t arrive until 10 years later.

Comments

Anonymous
September 07, 2016
I’m less sanguine about the ability to distinguish “UnicodeMath” from “unicode-math”, particularly as search engines will not so distinguish.(And if the LaTeX package is ever enhanced to incorporate Unicode Linear Math, the confusion will never cease.)
Anonymous
September 07, 2016
Great post.Will we ever get LaTeX based math in Office?Maybe LaTeX objects?Thank You.
- Anonymous
  April 18, 2017
  Next month's blog post :-)
Anonymous
September 26, 2016
Can't find rich edit 8 in win 10, RICHEDIT60W in win 10 failed, and can't use BuildUpMathSucceed with office 2013 RICHED20.DLL.Anyway to use BuildUpMath in win7 and higher OS without office?
Anonymous
December 15, 2017
>UnicodeMath is the most concise. It represents the simple fraction, one half, by the 3 characters “1/2”, whereas typical MathML takes 62 characters (consisting of the entity)Conciseness is irrelevant. You need some consistency in your format design, no wonder OOXML is such a mess.A completely blank .docx document is 11.710 bytes of zipped, XML-encoded boilerplate, so I don't think anyone cares about embedding 62 extra bytes. And you can't exactly type "√" or "▒" on a standard keyboard either.
- Anonymous
  September 28, 2018
  Please see http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.1.pdf for a description of how to enter “√” or “▒” with a keyboard (\sqrt, \of, etc.). I've found UnicodeMath to be considerably easier to use for unit tests. MathML/RTF/OMML are really hard to read in comparison. LaTeX is better and Unicode LaTeX still better, but UnicodeMath saves me, at least, lots of time.

Freigeben über