XeTeX text layout strangely sensitive to spaces

Question

I have a document in a script that requires complex text layout which I believe is supposed to work in XeTeX. But I get surprising results:

\documentclass{article}
\usepackage{fontspec}

\tracinglostchars=2 % https://tex.stackexchange.com/a/41235/48
\def\testtext{R ಶ್ರೀವತ್ಸ \quad Rಶ್ರೀವತ್ಸ}

\begin{document}

\fontspec{Arial Unicode MS} \testtext

\fontspec{Noto Sans Kannada} \testtext

\fontspec{Noto Serif Kannada} \testtext

\fontspec{Kedage} \testtext

\end{document}

When compiled with xelatex this gives:

For those who cannot read the script, the thing on the left (when the input has R ಶ್ರೀವತ್ಸ with a space after the R) is correct, while the thing on the right (the input has the same text but without the space after the R) is not.

I understand the “boxes” in the output: they are because the Kannada fonts selected don't have the R character in them. (A message to this effect is printed in the terminal thanks to \tracinglostchars=2.)

Question: Why is the output wrong when the space is omitted? And how can I make things work properly even without the space?

As I understand it, in XeTeX the text layout (aka text rendering, aka text shaping) is provided by the library HarfBuzz, which is used by a lot of other applications and should be able to handle this text fine. In LuaTeX they try to avoid system dependencies and hope to implement everything themselves (in Lua code), which probably underestimates the complexity of text layout and in any case LuaTeX currently has absolutely no support for any Indic scripts other than Devanagari and Malayalam. So this is what lualatex produces for the above file:

(At least it's consistently wrong which I understand!)

Edit: Thanks to @cfr's answer below, I know what I should do to resolve the actual problem: specify the script when loading the font (e.g. \fontspec{Noto Sans Kannada}[Script=Kannada] or the better way in her answer). So it's possible to resolve the issue; the only remaining question is: What's going on?

And for what it's worth, here's a minimal plain-XeTeX file that reproduces the issue (compile with xetex rather than xelatex):

\font\notosansnone="Noto Sans Kannada"
% \font\notosanskndt="Noto Sans Kannada:script=knd2"
\font\notosansknda="Noto Sans Kannada:script=knda"

\def\testtext{R ಶ್ರೀ Rಶ್ರೀ}

{\notosansnone \testtext} (No script)

% {\notosanskndt \testtext} (knd2)

{\notosansknda \testtext} (knda)

\bye

My guess: XeTeX detects “runs” of text and asks Harfbuzz to do the text layout for those runs, and is probably determining the “run” based on the first character in it. But this is just a guess, and in any case I need to know how to fix the problem. — ShreevatsaR, Jul 11 '17 at 18:29
@UlrikeFischer But (1) I don't want to change the input — I have a huge input file that contains text like Rಶ್ರೀವತ್ಸ and have it work. I think it should be possible. (Some context: the input file is exported from .odt (that my mother typed in) using writer2latex, and she's going to continue to edit it, and I'd rather not have to preprocess the input file each time.) and (2) I want to understand why this is happening here. — ShreevatsaR, Jul 11 '17 at 19:39
@cfr Good question, actually Polyglossia does support this script and would surely help with hyphenation in the whole document (as in this answer), but in this case adding it doesn't seem to affect this particular problem. — ShreevatsaR, Jul 11 '17 at 19:53
@ShreevatsaR Polyglossia probably isn't itself that helpful. I expect it is just the font configuration which matters. — cfr, Jul 11 '17 at 20:15
What happens in other applications? E.g. LibreOffice or Word? I'm guessing this behaviour is not TeX specific at all. I think it is coded into the lookup tables. However, this is just a guess looking at the font in FontForge. (If I knew something about the script or more about this kind of font, I might be more confident about my guess than I am.) — cfr, Jul 12 '17 at 04:02
What I don't understand really is why it works without the space when you do specify the script. It seems to me that it really ought not do so because it needs the space. — cfr, Jul 12 '17 at 04:10
@cfr In other applications it seems to work fine without the space. But then again they try to do a lot of complex things like font fallback (and they don't even have a way of specifying the script) so they may be doing something behind the scenes. — ShreevatsaR, Jul 12 '17 at 04:23

score 5 · Accepted Answer · answered Jul 11 '17 at 20:14

5

I don't have the first or last font. However, Polyglossia works correctly for me. (I assume it would probably also work with just the correct font configuration, but I did it this way as this is presumably what you want in the end.)

\documentclass{article}
\usepackage{polyglossia}
\setmainlanguage{kannada}
\setotherlanguage[variant=british]{english}
\newfontfamily\kannadafont{Noto Serif Kannada}[Script=Kannada]
\newfontfamily\kannadafontsf{Noto Sans Kannada}[Script=Kannada]
\tracinglostchars=2 % https://tex.stackexchange.com/a/41235/48
\def\testtext{R ಶ್ರೀವತ್ಸ \quad Rಶ್ರೀವತ್ಸ}

\begin{document}

% \fontspec{Arial Unicode MS} \testtext

\testtext

\sffamily \testtext

% \fontspec{Kedage} \testtext

\end{document}

answered Jul 11 '17 at 20:14

cfr

198,882

Ooh very helpful, thank you! I tried minimizing the difference between the two files, and found that the main thing is adding [Script=Kannada] when naming/defining the font. With it, things work, and without it they don't. (E.g. adding [Script=Kannada] to my file makes things work, or removing it from yours brings back the problem.) I think we're much closer now to understanding why… – ShreevatsaR Jul 11 '17 at 20:26
1

@ShreevatsaR Yes. That's why I said I thought it was the font configuration which mattered rather than Polyglossia per se. I copied from Polyglossia's manual, but you can give the same configuration with just fontspec, as you say. – cfr Jul 11 '17 at 21:24

ShreevatsaR · Answer 2 · 2017-07-13T02:54:00.163

(Sharing what I understood as a result of all this.)

Solutions

Firstly, the solutions to the problem:

As @cfr's answer pointed out, I should have used [Script=Kannada] for this font, as documented in the fontspec and polyglossia manuals. And when it's used, everything works as expected: with the space or without, the whole text is rendered as appropriate for the Kannada script.
Additionally, we actually don't want the non-Kannada characters like the R rendered in the Kannada script: the different-script characters like R must be marked as being in a different language or at least a different font (see below for how to do this).

So is this a bug, either in XeTeX or some library it uses? No, I'd say it's a user error. Still, the fact that everything works fine when there are spaces between words (without having to specify the script) perhaps makes this user error more likely.

Explanation

What explains this discrepancy in behaviour depending on the space (just what is going on)? And can this behaviour be changed in XeTeX? What I found is the following.

The library used by XeTeX for text layout, namely HarfBuzz (which is used in Firefox, Chrome, LibreOffice, etc., see What is Harfbuzz?), comes with a command-line program called hb-view which can be invoked with a font and a string of text. With it I get the following output:

hb-view NotoSansKannada-Regular.ttf "ಶ್ರೀ" and with --script=knda:
hb-view NotoSansKannada-Regular.ttf " ಶ್ರೀ" and with --script=knda:
hb-view NotoSansKannada-Regular.ttf "Rಶ್ರೀ" and with --script=knda
hb-view NotoSansKannada-Regular.ttf "R ಶ್ರೀ" and with --script=knda

What this shows is that the output is correct if either the first non-space character is from the right script, or the script is specified explicitly.

So the behaviour seen in XeTeX (the difference between "Rಶ್ರೀ" and "R ಶ್ರೀ") is explained by what @Ulrike Fischer pointed out in The XeTeX companion:

XeTeX’s approach is the following:

the typesetting process collects runs of characters (words) whose widths are obtained via the API to the system libraries […] to determine the widths,

a XeTeX paragraph is a sequence of word nodes separated by glue.

Thus XeTeX’s typesetting engine places words rather than glyphs, the latter being drawn by the font rendering engine.

(The “system libraries” and “font rendering engine” above are HarfBuzz now (thanks to Khaled Hosny); they used to be ICU earlier.) So

with “Rಶ್ರೀವತ್ಸ”, XeTeX asks HarfBuzz to render that whole string as one unit, which fails (as seen in the hb-view experiments above) because it neither starts with a character from the desired script nor did we specify the script correctly, while
with “R ಶ್ರೀವತ್ಸ”, XeTeX asks HarfBuzz separately for each of the two words, and in this case the second word is correctly rendered (even if we didn't specify the script) because it starts with a character from the correct script.

Still it seems best not to rely on such guessing, and specify the script explicitly.

Working with both scripts

To have both scripts work smoothly, we ought to specify that the characters like R are in a different language. We could do this by writing \textenglish{R}ಶ್ರೀವತ್ಸ instead of Rಶ್ರೀವತ್ಸ. If we don't want to change the input though, there is a way to do this using the ucharclasses package.

I wasn't able to get it to work for some reason, so I just did it manually (referring to the example in texdoc xetex and a post from the author of ucharclasses, and with 255 changed to 4095 as mentioned in for example this answer):

\documentclass{article}
\usepackage{fontspec}
\usepackage{polyglossia}

\newfontfamily\kannadafont{Noto Serif Kannada}[Script=Kannada]
\newfontfamily\englishfont{Georgia}
\setdefaultlanguage{kannada}
\setotherlanguage{english}

\XeTeXinterchartokenstate = 1   % Enable the character classes functionality

\newXeTeXintercharclass \CharEnglish
\XeTeXcharclass `R = \CharEnglish

\XeTeXinterchartoks 0 \CharEnglish = {\selectlanguage{english}}
\XeTeXinterchartoks 4095 \CharEnglish = {\selectlanguage{english}}
\XeTeXinterchartoks \CharEnglish 0 = {\selectlanguage{kannada}}
\XeTeXinterchartoks \CharEnglish 4095 = {\selectlanguage{kannada}}

\begin{document}

R ಶ್ರೀವತ್ಸ \quad Rಶ್ರೀವತ್ಸ

\end{document}

This changes the language every time we move between an English character (only R above) and either a word boundary (4095) or a regular (not specified to be English) character (0).

For my original document, to deal with all the English characters, I wrote a loop to do the equivalent of

\XeTeXcharclass `R = \CharEnglish

for every uppercase and lowercase letter of the alphabet:

\newcount\tmpchar
\tmpchar = `A
\loop
  \ifnum \tmpchar < `[          % [ comes just after Z
    \XeTeXcharclass \tmpchar = \CharEnglish
    \XeTeXcharclass \lccode \tmpchar = \CharEnglish
    \advance \tmpchar by 1
\repeat

I don't know the details but I do know that xetex works on "words". See e.g. here around page 31 (the document is outdated, it knows e.g. nothing about harfbuzz!) http://xml.web.cern.ch/XML/lgc2/xetexmain.pdf . That's why I suggested to separate the R from the rest. — Ulrike Fischer, Jul 12 '17 at 09:23
@UlrikeFischer Wow that's fantastic, thank you so much! The document looks really excellent. Very clear. And the page you quoted is exactly helpful, e.g. it says “XeTeX’s approach is the following: • the typesetting process collects runs of characters (words) whose widths are obtained via the API to the system libraries (e.g., ICU) to determine the widths, • a XeTeX paragraph is a sequence of word nodes separated by glue” Thanks again, it makes so much sense. — ShreevatsaR, Jul 12 '17 at 09:30
Note that TeX never typesets a space in the sense of a character. It is always interword glue, whatever the engine. Also, note that the way the font is constructed, it is relying heavily on 'chained' substitutions. My initial guess was that these might be working similarly to start of word ligatures in standard TeX fonts. However, that would suggest it should show the same behaviour in, say, LibreOffice, which doesn't appear to be the case. However, the way these substitutions work may nonetheless be part of the story here. Take a look at the substitutions and opentype features in FontForge. — cfr, Jul 12 '17 at 15:43
Very useful info re. harfbuz, by the way. Thanks for hunting. Typically, substitutions may pick up the context e.g. substitute if c is at the start of a word, substitute s if it is in the middle of a word, substitute z if at the end of a word, substitute c if at the start of the word and followed by h ... and so on .... But, as I understand it, with TTF/OTF, these would be features of the font rather than of XeTeX per se. — cfr, Jul 12 '17 at 15:47
@cfr You're right. As I understand it, in OpenType fonts like this, the substitutions are indeed features specified in the font, and HarfBuzz is the library used by XeTeX that understands these features of the font and “acts” on them. So for example the font may say “when char c1 is followed by char c2, substitute glyph g3” (which I think is the case here), and HarfBuzz is responsible for making that happen and passing the correct glyphs to XeTeX. I’ll try to look into the font as you suggested. — ShreevatsaR, Jul 12 '17 at 16:46
Something different goes on in e.g. LibreOffice because I get the R even if I have everything set to Noto Serif Kannada, which doesn't have the glyph. Also, it substitutes a sans glyph. So it is clearly doing some kind of substitution. Also, if I just paste the examples, it picks up that the document includes multiple languages automatically. So it is doing a lot more automatic recognition stuff than XeTeX. (This is to be expected, I think. XeTeX doesn't try to do this recognition automatically, although I believe Polyglossia can do some auto-recognition to some degree in some cases.) — cfr, Jul 12 '17 at 22:50
But really, the R needs to be marked up as a different language. (Or the rest, but I'm assuming the R is going to be the non-default.) — cfr, Jul 12 '17 at 22:51
@cfr Thanks for your interest in this. You're right, the R needs to be marked up as a different language… I added how I ended up doing that. — ShreevatsaR, Jul 13 '17 at 00:39
Interesting. Very. Curious that beginning and ending word boundary gets the same code. In a traditional font, you have to handle these cases separately and beginning of word boundaries need an additional slot (if I'm remembering things the right way around). Your usage doesn't seem to be fully documented in the XeTeX manual .... — cfr, Jul 13 '17 at 01:08
@cfr I based the usage on the example from texdoc xetex and some explanation from the author of ucharclasses here, but with 255 changed to 4095 (as I found e.g. in this answer) — it indeed all seems somewhat obscure and poorly documented. :-) — ShreevatsaR, Jul 13 '17 at 01:27
You should add the references to your answer - comments get deleted. — cfr, Jul 13 '17 at 02:37
Your explanation is pretty much close to what is actually happening. In short, if no script is specified, XeTeX will try to guess the script from the text and it uses a pretty simple method that basically takes the script of the first character (similar to what hb-view does). Older version of XeTeX (IIRC) just hard-coded Latin as a default of no script was specified. XeTeX does not do any clever script segmentation that other modern applications typically do. Modern applications also do font fallback, so you always get a glyph for the R even if the font lacks it. — خالد حسني, Jul 17 '17 at 10:40
@KhaledHosny Thanks for confirming! I wasn't sure whether the “guess the script using the first character” was being done by hb-view and by XeTeX, or by HarfBuzz itself. As you mentioned font fallback: do you know if it would be possible to make XeTeX support it? See this question. — ShreevatsaR, Jul 17 '17 at 16:47
I’m not involved with XeTeX development anymore so I don’t know much about its ongoing development. — خالد حسني, Jul 17 '17 at 18:42

XeTeX text layout strangely sensitive to spaces

2 Answers2

Solutions

Explanation

Working with both scripts