How to write both English (ASCII) and Russian (Unicode) characters in same line?

Question

When you open a UTF-8 file in almost any text editor, you see text where each character is rendered properly, separately from other characters around. That is what Unicode table exists for and fonts provide the models. Why is LaTeX so much more complicated than that?

I have seen many ways to print multilingual texts, ones work with pdflatex, others with xelatex or luatex... then there are many packages to include font encoding, some babels, polyglossia and others... And YET, I still need to tell LaTeX whenever I mind to use another language "Hey, the next character is a Russian character" even though the very encoding of the file (UTF) tells exactly that!

Why isn't there a simple solution, like: one package for defining the encoding (for example UTF-8 and not UTF-16) and one for providing fonts? I do NOT intend to write funny combinations like 'Zh' if I want Ж. If I want Ж, I will write Ж just like in any text editor. So without these funny tricks, interpretation is pretty much trivial, following the Unicode standard.

Welcome to TeX.SX! Actually, with a modern TeX installation it is exactly the way you describe. The only restriction is the font. The default font LaTeX uses has a very limited range of glyphs. But as soon as you load a font that covers a larger range of characters, you are free to type in directly whatever character this font provides. Consider the following code (to be compiled with XeLaTeX or LuaLaTeX): \documentclass{article} \usepackage{fontspec} \setmainfont{Arial} \begin{document} Hello Привет \end{document}. — Jasper Habicht, Nov 04 '22 at 20:52
@JasperHabicht Ok, you're right. It worked just like you said. I can't believe I couldn't find that solution earlier. Thank you! — donaastor, Nov 04 '22 at 21:06
If you don't have (or don't want to use) a font that has all the characters then you can also set fallback fonts or define a fdifferent ont for specific character ranges, which allows you to type any combination of characters once everything is setup. See for example https://tex.stackexchange.com/questions/613999/switching-between-three-scripts-using-ucharclasses-does-not-seem-to-work/614081.or https://tex.stackexchange.com/questions/619555/how-correctly-use-luaotfload-fallback for some approaches. — Marijn, Nov 04 '22 at 21:30
note latex language switches do lot more than just render the characters, even if they use the same script, such as French and English, you need to set up hyphenation and other details — David Carlisle, Nov 04 '22 at 21:37
Side note, the part that LaTeX tutorial recommends you to write things like \Zh etc. simply reflects the preference of the LaTeX maintainers/tutorial writers instead of being "complicated", see https://tex.stackexchange.com/a/618018/250119 and https://tex.stackexchange.com/a/87268/250119. There are ways to make it work with entering Unicode characters (comments above). — user202729, Nov 05 '22 at 09:22

score 2 · Answer 1 · answered Nov 05 '22 at 10:20

There are reasons for LaTeX not allowing this by default, the main ones being that pdflatex only uses 8-bit fonts and that even OpenType/TrueType fonts used in XeLaTeX or LuaLaTeX don't necessarily cover all of Unicode.

For instance, the default Latin Modern fonts in (Xe|Lua)LaTeX don't cover neither Cyrillic nor Greek.

This said, you can enable inputting Cyrillic characters in pdflatex by setting a default encoding for the letters. Don't blame me if you get awkward hyphenation.

\documentclass{article}
\usepackage[T2A,T1]{fontenc}
\ExplSyntaxOn
\tl_gclear:N \g_tmpa_tl
\group_begin:
% disable a few commands
\newcommand{\xxxignoredtc}{}
\newcommand{\ignoredtc}[1][0]{\renewcommand{\xxxignoredtc}[#1]}
\renewcommand\DeclareTextCommand[2]{\ignoredtc}
\cs_set_eq:NN \DeclareTextAccent \use_none:nnn
\cs_set_eq:NN \DeclareTextComposite \use_none:nnnn
\cs_set:Npn \DeclareTextSymbol #1 #2 #3
 {
  \str_if_eq:eeT { __donaastor_three:n { #1 } } { cyr }
   {
    \tl_gput_right:Nn \g_tmpa_tl { \DeclareTextSymbolDefault{#1}{T2A} }
   }
 }
\cs_set:Nn __donaastor_three:n
 {
  \str_foldcase:e { \str_range:enn { \token_to_str:N #1 } { 2 } { 4 } }
 }
\cs_generate_variant:Nn \str_foldcase:n { e }
\cs_generate_variant:Nn \str_range:nnn { e }
\file_input:n { t2aenc.def }
\group_end:
\tl_use:N \g_tmpa_tl
\ExplSyntaxOff
\begin{document}
Ж is not Zh
The Russian word Спасибо means ``thanks''.
\end{document}

The idea is to use t2aenc.def in a different way, but some commands have first to be disabled and \DeclareTextSymbol must be converted to do, say,

\DeclareTextSymbolDefault{\CYRA}{T2A}
\DeclareTextSymbolDefault{\cyra}{T2A}

from

\DeclareTextSymbol{\CYRA}{\LastDeclaredEncoding}{192}
\DeclareTextSymbol{\cyra}{\LastDeclaredEncoding}{224}

However, lines such as

\DeclareTextSymbol{\guillemotleft}{\LastDeclaredEncoding}{190}

should be ignored. So the code above first redefines the other commands appearing in t2aenc.def to do nothing. It then redefines \DeclareTextSymbol to examine its first argument, which is “stringified” and case folded; the characters from the second to the fourth are then compared with cyr. If the test succeeds, we add the suitable \DeclareTextSymbolDefault to a global scratch token list.

Everything is done in a group, so the meaning of the redefined commands will be restored at the end of the group; but the global scratch token list survives the end of the group and we can execute its contents.

For short Cyrillic inserts this might be enough. For longer passages, load babel in the appropriate way and mark up the language change.

How to write both English (ASCII) and Russian (Unicode) characters in same line?

1 Answers1