Freigeben über


Plain Text Math in Bidirectional Contexts

In plain text, bidirectional text (text including some Arabic or Hebrew) is generally displayed according to the Unicode Bidi Algorithm (UBA). Since we’re interested in math, it’s pertinent to study a bit how simple mathematical expressions appear according to the UBA. Ideally math is displayed using higher-order protocols that overrule the UBA appropriately, but in this post let’s consider only very simple overrides that fix balanced parentheses, brackets and braces. An earlier post presents such an algorithm, which makes some mathematical expressions look reasonable in contrast to how they look with the UBA. The present post compares results using that algorithm with one that’s slightly different and yields better results. Mathematical text should be formatted according to rules for math zones. But plain text is still very common, so it’s useful to know how to make it do a better job at displaying simple formulae.

First some background Bidi stuff (skip this and the next paragraph if you already know the UBA). Bidi text runs are assigned a Bidi level, 0, 1, 2, … The even levels are displayed left-to-right (LTR) and the odd levels are displayed right-to-left (RTL). The Arabic and Hebrew scripts are displayed RTL and almost all other scripts are displayed LTR. The one major exception is that numbers are always displayed LTR even if they use Arabic-Indic digits like ١٢٣ (123). A consequence of this exception is that even pure Hebrew text and pure Arabic text are Bidirectional (have both LTR and RTL text runs) if they contain numbers.

The complexity of the UBA enters in the handling of so-called neutral characters of various kinds, e.g., punctuation including parentheses, brackets, and braces, along with the overall placement of numbers. In a pure math zone, all such neutrals (except for period, comma, and space) have the directionality of the math zone itself. But outside math zones, the rules are given by the UBA and are considerably more complicated. The reason this statement applies to pure math zones is that “impure” math zones contain one or more runs of embedded normal text that typically follow the rules of the UBA.

Okay, so here’s the first UBA surprise: an LTR paragraph consisting of “a(b)” displays this text as such, but alone in an RTL paragraph it displays as “(a(b”. This is because in the RTL paragraph (base level 1) the opening parenthesis is between two strong LTR characters and accordingly all three are assigned the level 2 to force them to be display LTR. But the closing parenthesis is between the “b” and the end-of-paragraph mark and the UBA assigns the directionality of the paragraph (1) to that parenthesis and therefore that it be mirrored and placed to the left. Clearly this isn’t what you want.

What does the parentheses algorithm do? It displays the text as “(b)a” when alone in an RTL paragraph. This is actually exactly how it would be displayed in an RTL math zone. But more involved expressions don’t share the same fortune. Consider the LTR “f(x)+g(y)”. In an RTL math zone, this should be displayed as “(y)g+(x)f”, i.e., right to left. The UBA displays it as “(f(x)+g(y” and the parentheses algorithm displays it as “(y)f(x)+g”. Similarly alone in an RTL paragraph the LTR expression “f(x)g(y)” displays as “(f(x)g(y” according to the UBA and “(y)f(x)g” according to the parentheses algorithm. The UBA is clearly suboptimal in all cases, but although the parentheses algorithm does display parentheses as you would expect, it doesn’t succeed in a reasonable ordering of the text units.

To see if we can improve this situation, let’s modify the parentheses algorithm in a way proposed by Windows developer Robert Steen. The original algorithm sets the parentheses pair level equal to the smaller of the two parenthesis levels and then increments that level if the level doesn’t have the same directionality as the paragraph (or more generally, the current embedding). Instead, Robert chooses the larger of the two parenthesis levels for the pair. In both choices, if characters inside the parentheses have a level less than that of the pair, the levels of those characters are incremented by 2 to keep the characters inside the parentheses and to keep their same directionality.

According to this modified algorithm, the LTR “f(x)+g(y)” displays the same way alone in an RTL paragraph. Okay, this isn’t the way it should display in an RTL math zone, but at least it’s readable and understandable. Ditto for “f(x)g(y)”. Let’s check it out using Arabic variables.

Alone in an LTR paragraph the logically ordered “ف(ق)+ك(ل)” displays according to the UBA as “ل)ك+(ق)ف)”. Alone in an RTL paragraph it displays as “(ل)ك+(ق)ف”, which is how it displays in an RTL math zone.

According to the parentheses algorithm alone in an LTR paragraph, “ف(ق)+ك(ل)”displays as “ك+(ق)ف(ل)”, which isn’t what you want. Alone in an RTL paragraph it displays as “(ل)ك+(ق)ف”, which is also the way it displays in an RTL math zone.

According to the modified parentheses algorithm alone in an LTR or RTL paragraph, “ف(ق)+ك(ل)” displays as “(ل)ك+(ق)ف”, the way it displays in an RTL math zone. So with this algorithm, simple math expressions using either all LTR variables or all RTL variables display with the directionality of the variables in paragraphs of either directionality. This seems to be a better plain-text compromise for mathematics.