Is it possible to typeset Unicode input in multiple language scripts automatically without inserting special commands?

Question

I'm trying to create a glossary in some LaTeX variant, and I want to insert etymologies like this:

دَفْتَر • (daftar) m (plural دَفَاتِر‎ (dafātir)): From Middle Persian dptl (daftar), from Aramaic דפתרא‎ / ܕܦܬܪܐ‎, from Ancient Greek διφθέρα (diphthéra).

The raw text looks fine in my browser, and in Gedit, and in my terminal. The fonts are all there. However, when I try to follow various instructions to typeset languages in LaTeX I run into problems. I should use XeTeX, which is OK, but then I have to run a bunch of commands like \newfontfamily and \setotherlanguage and then wrap "language-switching" commands like \textgreek and \texthebrew and \textarabic around every language transition. I can't even find a \textaramaic command, does the script have a different name?

I'm not very concerned about appearances, I just want something that prints a formatted glossary where the various words in each language are legible. Is there no "plug-and-play" method for handling multilingual Unicode text in TeX-derived typesetting languages?

Maybe the best solution would be to output HTML and print from the browser? I guess the other alternative is to use Perl to insert TeX markup around each detected language, using a bunch of regular expressions: qr/\p{Arabic}/. But that seems cumbersome...

it is possible to get an approximation to automatic direction detection so if you have a font that covers all the scripts you need you can avoid any markup at the language switches if you do not need language specific hyphenation — David Carlisle, Sep 06 '18 at 20:30
although I'd have thought you wanted markup here to give more flexibility in layout choices, eg \newcommand\fromgreek[1]{from Ancient Greek \textgreek{#1}} would allow markup like \fromgreek{διφθέρα} in your entries. — David Carlisle, Sep 06 '18 at 20:34
How do I "get an approximation to automatic direction detection"? That is what is missing from Davislor's answer, I tried ucharclasses and it works but typesets each Arabic word RTL but consecutive words are LTR. This is something that Firefox and Gedit get right without any help (e.g. they don't need Unicode bidi markers to correctly render sequences of RTL words). — Metamorphic, Sep 07 '18 at 19:22
actually the approximation for xetex may not be that close , see for example https://tex.stackexchange.com/questions/403075/bidi-algorithm-and-xelatex browsers are of course far more modern architecture and using (typically) an off the shelf c++ bidi library fits them well, for tex, even xetex it's harder to fit in to the model and never been that high priority as you normally need to switch other things, fonts, hyphenation, ... as well as direction so "just" referencing a bidi algorithm would be hard to do but probably not that useful in many cases. — David Carlisle, Sep 07 '18 at 20:18

Davislor · Answer 1 · 2020-07-11T02:33:43.727

Updated Answer

As of TeX Live 2020, babel is able to select the correct language automatically based on the script you use in the source, with a command such as

\babelprovide[import, onchar=ids fonts]{greek}

If you then select a \babelfont for Greek, it will also select that font with an appropriate Script= and Language= option, e.g.

\babelfont{rm}[Ligatures=Common]{CMU Serif}

Unfortunately, ucharclasses is broken in current versions of XeLaTeX and never worked with other engines at all.

Original Answer

The package ucharclasses gives a way to do something like that. If you give it a command such as

\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchUppercase, Ligatures=TeX}
\newfontfamily{\defaultfont}{Latin Modern Roman}[Scale=1.0, Ligatures={Common, TeX}]
\newfontfamily{\malayalamfont}{Arial Unicode MS}
\usepackage[Malayalam]{ucharclasses}
\setDefaultTransitions{\defaultfont}{}
\setTransitionTo{Malayalam}{\malayalamfont}

you can then simply start typing in Malayalam. (If your font contains OpenType Script and Language support, you probably want to turn that on as well.) Here is the documentation.

This can’t work in absolutely every case because the same Unicode codepoints might mean something different in different languages, such as Bulgarian and Russian or Japanese and traditional Chinese. You don’t get the same hyphenation support you would from Polyglossia or Babel, and so on. I personally prefer the semantic markup. But, you can come as close as most documents will need.

As of January 2021, the new answer doesn't work for me. If I do \babelprovide[import, onchar=ids fonts]{greek} and compile with xelatex, I get Undefined control sequence complaining about \directlua. If I try with lualatex, it seems to be incompatible with the navigator package. — , Jan 18 '21 at 19:01
@BenCrowell onchar= only works with LuaLaTeX. ucharclasses should work in XeLaTeX, or you could look for an alternative to your XeTeX-only package. — Davislor, Jan 19 '21 at 02:38
@BenCrowell Although that's actually a really bad example of ucharclasses, and I ought to fix it. Instead of changing to normalfont as a default transition, you should \begin{malayalam} and \end{malayalam} as your transitions. Also, you should load the Malayalam font with the correct Script= and Language=. — Davislor, Jan 19 '21 at 02:42
Should the updated answer be edited to make clear that the new TeXLive 2020 solution only works in LuaLaTeX? (If that is the case, as I gather from the comments.) Also, @Davislor, is the ucharclasses "broken" with XeLaTeX as you say in your updated answer, or is it the solution of choice for XeLaTeX, as you suggest in the comments? (Maybe something changed between July 2020 and now?) — Alex Roberts, Feb 02 '21 at 13:42

Is it possible to typeset Unicode input in multiple language scripts automatically without inserting special commands?

1 Answers1

Updated Answer

Original Answer

Linked