Udostępnij za pośrednictwem


Recognizing LaTeX Input in UnicodeMath Input Mode

In offering a LaTeX math input mode, we’ve run into the problem that a user might type some LaTeX while the UnicodeMath input mode is active and get something unintended and confusing. This post reveals ways in which the build-up engine can recognize this situation and maybe cue the user to switch to the LaTeX input mode. Furthermore, some purely LaTeX constructs, like \frac{}{}, could be handled correctly in UnicodeMath input mode. It seems more user friendly to do so than to build up to an undesired result. UnicodeMath and LaTeX are compared a bit here.

Symbol Control Words

Control words for symbols work in either input mode, for example, \alpha inserts α in both modes. So, there’s no need to change the input mode for symbol control words. Similarly, Unicode symbols like ∬ work in both input modes. The build-up engine supports Unicode LaTeX since the Office math facility was based on Unicode from the start. Note that UnicodeMath is defined in terms of Unicode symbols, not ASCII-letter control words, but the latter are supported by the input engine for ease of entry on standard keyboards. On-screen keyboards may offer more direct ways of entering Unicode symbols.

TeX Math Zone Delimiters

If a math zone begins with a $, the input must be TeX or LaTeX, since $ has no special significance in UnicodeMath and Office apps use the math-zone character format effect to define math zones. But the user might not start with a $, so it’s worth handling other ways that distinguish the formats. The LaTeX math-zone start delimiters \[ and \( have useful meanings in UnicodeMath, namely to treat the [ and ( literally instead of treating them as autosizing build-up delimiters.

Structure and Environment Control Words

Some structure control words such as \frac and \binom are only defined in LaTeX and others like \matrix and \pmatrix are defined in both modes. The user pain enters when typing something like \frac{a}{b} in UnicodeMath mode. The {…} get built up as curly braced expressions and the \frac remains unchanged. No fraction results and the user may wonder what went wrong.

When the user types LaTeX-only structure control words like \frac or \binom in UnicodeMath input mode, it’s clear that LaTeX is intended and the user can be asked whether the input mode should switch to LaTeX. Similarly, structure control words valid in both input modes become unambiguous when the user types the argument start delimiter. For LaTeX the start delimiter is {, while for UnicodeMath it’s (. So, \matrix( must be UnicodeMath, while \matrix{ must be LaTeX. Note that LaTeX by design supports the original TeX control-word sequences like \matrix{…} as well as the LaTeX environments like \begin{matrix}…\end{matrix}. In UnicodeMath autobuildup mode, no build up occurs when the user types \matrix{, so it’s possible at that point to switch to LaTeX input without need for retyping.

Both input modes have \begin and \end, but in LaTeX these are environment control words followed by {, whereas in UnicodeMath they represent generic start/end delimiters for which curly braces would be superfluous. So as soon as the user types { following \begin or \end, a cue recommending a switch to LaTeX input mode can be displayed.

Math Functions

Math functions are also treated differently in LaTeX and in UnicodeMath. To enter the sine function in LaTeX, one types \sin, whereas in UnicodeMath, one just types sin. So, if a math function name is entered preceded by \, a cue recommending a switch to LaTeX input mode can be displayed. The Office math display engine needs to know the argument of a math function as well as the function name in order to insert the correct math spacing. LaTeX doesn’t have a formal way of defining the argument, although enclosing it in curly braces is a good idea. UnicodeMath has precise ways of defining the argument. This is also true for integrands of integrals and n-aryands of n-ary operators in general.

Superscripts and Subscripts

The input a^2+b^2=c^2 represents the same equation in either input mode, but a^10+b^10=c^10 represents a¹⁰ + b¹⁰ = c¹⁰ in UnicodeMath and a¹0 + b¹0 = c¹0 in LaTeX. It doesn’t seem possible to distinguish the user intent for such cases, but it’d be worth asking the user who types a^{ or a_{ whether to switch to LaTeX, since superscript and subscripts enclosed in curly braces aren’t common in mathematical expressions. Expressions involving exp{…} do occur, but it’s better typography to use exp{…} instead of raising e to a braced power.

Miscellaneous Control Sequences

Font control words like \mathbf{ are distinctly LaTeX. The TeX binomial-coefficient construct {n\choose k} doesn’t make sense in UnicodeMath (one would type n\choose k without the curly braces). But {n\atop might be used in UnicodeMath since {n\atop k} would build up as n over k (without a fraction bar) enclosed in {}. Admittedly this construct is unlikely since binomial coefficients appear in parentheses, not in curly braces.

Conclusions

We see that there are quite a few [La]TeX constructs that don’t make sense in UnicodeMath and can be used to query the user about switching from UnicodeMath input mode to LaTeX input mode. In addition, such LaTeX-oriented control sequences could be handled directly in UnicodeMath mode. The math build-up engine in Microsoft Office uses the same operator and string stacks for both modes, so it’s fairly straightforward to treat constructs like \frac{…}{…}, \matrix{…}, \begin{matrix}…\end{matrix} directly in UnicodeMath mode. This might make math input more user friendly for people familiar with LaTeX. And it might facilitate migrating to the speedier, more mathematical UnicodeMath input mode. But it does compromise using the build-up engine as a UnicodeMath validator. To that end, if the build-up engine is modified to handle these LaTeX control sequences in UnicodeMath mode, it might be worth having a “strict” mode that would fail input with invalid UnicodeMath. In any event, build-down results are all in one format or the other, not in a mixture of the two.