Using XeTeX for automatic transliteration of cyrillic letters

Question

This is a follow-up question to the following question: Serbian Cyrillic using LuaTeX and XeTeX.

I actually often need character substitution the other way around, that is, I write cyrillic, but want transliterated output, e.g. I type добрый but get dobryj in the result document. This is very handy, and I use the following mappings with pdflatex to achieve this (I'm including this so people can reuse it):

\DeclareUnicodeCharacter{1040}{A}
\DeclareUnicodeCharacter{1041}{B}
\DeclareUnicodeCharacter{1042}{V}
\DeclareUnicodeCharacter{1043}{G}
\DeclareUnicodeCharacter{1044}{D}
\DeclareUnicodeCharacter{1045}{E}
\DeclareUnicodeCharacter{1046}{Ž}
\DeclareUnicodeCharacter{1047}{Z}
\DeclareUnicodeCharacter{1049}{J}
\DeclareUnicodeCharacter{1050}{K}
\DeclareUnicodeCharacter{1051}{L}
\DeclareUnicodeCharacter{1052}{M}
\DeclareUnicodeCharacter{1053}{N}
\DeclareUnicodeCharacter{1054}{O}
\DeclareUnicodeCharacter{1055}{P}
\DeclareUnicodeCharacter{1056}{R}
\DeclareUnicodeCharacter{1057}{S}
\DeclareUnicodeCharacter{1058}{T}
\DeclareUnicodeCharacter{1059}{U}
\DeclareUnicodeCharacter{1060}{F}
\DeclareUnicodeCharacter{1062}{C}
\DeclareUnicodeCharacter{1063}{Č}
\DeclareUnicodeCharacter{1064}{Š}
\DeclareUnicodeCharacter{1069}{Ė}
\DeclareUnicodeCharacter{1070}{Ju}
\DeclareUnicodeCharacter{1071}{Ja}
\DeclareUnicodeCharacter{1025}{Ë}
\DeclareUnicodeCharacter{1072}{a}
\DeclareUnicodeCharacter{1073}{b}
\DeclareUnicodeCharacter{1074}{v}
\DeclareUnicodeCharacter{1075}{g}
\DeclareUnicodeCharacter{1076}{d}
\DeclareUnicodeCharacter{1077}{e}
\DeclareUnicodeCharacter{1078}{ž}
\DeclareUnicodeCharacter{1079}{z}
\DeclareUnicodeCharacter{1080}{i}
\DeclareUnicodeCharacter{1081}{j}
\DeclareUnicodeCharacter{1082}{k}
\DeclareUnicodeCharacter{1083}{l}
\DeclareUnicodeCharacter{1084}{m}
\DeclareUnicodeCharacter{1085}{n}
\DeclareUnicodeCharacter{1086}{o}
\DeclareUnicodeCharacter{1087}{p}
\DeclareUnicodeCharacter{1088}{r}
\DeclareUnicodeCharacter{1089}{s}
\DeclareUnicodeCharacter{1090}{t}
\DeclareUnicodeCharacter{1091}{u}
\DeclareUnicodeCharacter{1092}{f}
\DeclareUnicodeCharacter{1094}{c}
\DeclareUnicodeCharacter{1095}{č}
\DeclareUnicodeCharacter{1096}{š}
\DeclareUnicodeCharacter{1101}{ė}
\DeclareUnicodeCharacter{1102}{ju}
\DeclareUnicodeCharacter{1103}{ja}
\DeclareUnicodeCharacter{1105}{ë}
\DeclareUnicodeCharacter{1110}{i}
\DeclareUnicodeCharacter{1030}{I}
\DeclareUnicodeCharacter{1108}{je}
\DeclareUnicodeCharacter{1028}{Je}
\DeclareUnicodeCharacter{1061}{X}
\DeclareUnicodeCharacter{1093}{x}
\DeclareUnicodeCharacter{1048}{I}
\DeclareUnicodeCharacter{1065}{ŠČ}
\DeclareUnicodeCharacter{1066}{'}
\DeclareUnicodeCharacter{1067}{Y}
\DeclareUnicodeCharacter{1068}{'}
\DeclareUnicodeCharacter{1097}{šč}
\DeclareUnicodeCharacter{1098}{'}
\DeclareUnicodeCharacter{1099}{y}
\DeclareUnicodeCharacter{1100}{'}

My question is: is there a straight-forward way of reusing this very mapping in XeTex? I assume: no, I need to input all the UTF-8 codes, right? But maybe somebody else has already done that. Is there any repository of mapping files?

score 12 · Accepted Answer · answered Feb 23 '12 at 22:04

The method is similar to that one used for Serbian. Prepare the following cyrillic-to-latin.map file:

; TECkit mapping for TeX input conventions <-> Unicode characters

LHSName "Cyrillic-to-Latin"
RHSName "UNICODE"

pass(Unicode)

; ligatures from Knuth's original CMR fonts
U+002D U+002D           <>  U+2013  ; -- -> en dash
U+002D U+002D U+002D    <>  U+2014  ; --- -> em dash

U+0027          <>  U+2019  ; ' -> right single quote
U+0027 U+0027   <>  U+201D  ; '' -> right double quote
U+0022           >  U+201D  ; " -> right double quote

U+0060          <>  U+2018  ; ` -> left single quote
U+0060 U+0060   <>  U+201C  ; `` -> left double quote

U+0021 U+0060   <>  U+00A1  ; !` -> inverted exclam
U+003F U+0060   <>  U+00BF  ; ?` -> inverted question

; additions supported in T1 encoding
U+002C U+002C   <>  U+201E  ; ,, -> DOUBLE LOW-9 QUOTATION MARK
U+003C U+003C   <>  U+00AB  ; << -> LEFT POINTING GUILLEMET
U+003E U+003E   <>  U+00BB  ; >> -> RIGHT POINTING GUILLEMET


U+0410 <> U+0041  ; A
U+0411 <> U+0042  ; B
U+0412 <> U+0056  ; V
U+0413 <> U+0047  ; G
U+0414 <> U+0044  ; D
U+0415 <> U+0045  ; E
U+0416 <> U+017D  ; Ž
U+0417 <> U+005A  ; Z
U+0418 <> U+004A  ; J
U+041A <> U+004B  ; K
U+041B <> U+004C  ; L
U+041C <> U+004D  ; M
U+041D <> U+004E  ; N
U+041E <> U+004F  ; O
U+041F <> U+0050  ; P
U+0420 <> U+0052  ; R
U+0421 <> U+0053  ; S
U+0422 <> U+0054  ; T
U+0423 <> U+0055  ; U
U+0424 <> U+0046  ; F
U+0426 <> U+0043  ; C
U+0427 <> U+010C  ; Č
U+0428 <> U+0160  ; Š
U+042D <> U+0116  ; Ė
U+042E <> U+004A U+0075  ; Ju
U+042F <> U+004A U+0061  ; Ja
U+0401 <> U+00CB  ; Ë
U+0430 <> U+0061  ; a
U+0431 <> U+0062  ; b
U+0432 <> U+0076  ; v
U+0433 <> U+0067  ; g
U+0434 <> U+0064  ; d
U+0435 <> U+0065  ; e
U+0436 <> U+017E  ; ž
U+0437 <> U+007A  ; z
U+0438 <> U+0069  ; i
U+0439 <> U+006A  ; j
U+043A <> U+006B  ; k
U+043B <> U+006C  ; l
U+043C <> U+006D  ; m
U+043D <> U+006E  ; n
U+043E <> U+006F  ; o
U+043F <> U+0070  ; p
U+0440 <> U+0072  ; r
U+0441 <> U+0073  ; s
U+0442 <> U+0074  ; t
U+0443 <> U+0075  ; u
U+0444 <> U+0066  ; f
U+0446 <> U+0063  ; c
U+0447 <> U+010D  ; č
U+0448 <> U+0161  ; š
U+044D <> U+0117  ; ė
U+044E <> U+006A U+0075  ; ju
U+044F <> U+006A U+0061  ; ja
U+0451 <> U+00EB  ; ë
U+0456 <> U+0069  ; i
U+0406 <> U+0049  ; I
U+0454 <> U+006A U+0065  ; je
U+0468 <> U+004A U+0065  ; Je
U+0425 <> U+0058  ; X
U+0445 <> U+0078  ; x
U+0418 <> U+0049  ; I
U+0429 <> U+0160  U+010C ; ŠČ
U+042A <> U+0027  ; '
U+042B <> U+0059  ; Y
U+042C <> U+2019  ; '
U+0449 <> U+0161  U+010D ; šč
U+044A <> U+2019  ; '
U+044B <> U+0079  ; y
U+044C <> U+2019  ; '

and run it through teckit_compile to produce the file cyrillic-to-latin.tec file that should be put in a place where XeTeX can find it. Then a document such as the following

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Ligatures=TeX]{Linux Libertine O}
\usepackage{polyglossia}
\setmainlanguage{english}
\setotherlanguage{russian}
\newfontfamily{\transrussian}[Mapping=cyrillic-to-latin]{Linux Libertine O}

\newenvironment{translitterated}
  {\transrussian\hyphenrules{nohyphenation}\ignorespaces}
  {\ignorespacesafterend}

\begin{document}

\begin{russian}
Москва — столица Российской Федерации, город федерального значения,
административный центр Центрального федерального округа и центр
Московской области, в состав которой не входит. Крупнейший по
численности населения город России и Европы (население на 1 января
2012 года — 11 629 116 человек), по этому показателю входит в
десятку крупнейших городов мира. Центр Московской городской
агломерации.
\end{russian}

\begin{translitterated}
Москва — столица Российской Федерации, город федерального значения,
административный центр Центрального федерального округа и центр
Московской области, в состав которой не входит. Крупнейший по
численности населения город России и Европы (население на 1 января
2012 года — 11 629 116 человек), по этому показателю входит в
десятку крупнейших городов мира. Центр Московской городской
агломерации.
\end{translitterated}

\end{document}

will give a result similar to the following

enter image description here

The nohyphenation in the translitterated environment definition is necessary as XeTeX doesn't know how to hyphenate translitterated Russian.

What about making all Cyrillic chars active expanding to their transliteration? This way hyphenation would work, wouldn't it? — Bruno Le Floch, Feb 24 '12 at 01:25
@Bruno: hyphenation patterns are certainly applied after commands have been expanded/executed. So with active chars you would need hyphenation patterns for the translitterated russian. I'm not so sure about mappings. In this message http://tug.org/mailman/htdig/xetex/2005-November/002842.html from Jonathon is sounds as if font mappings are applied after hyphenation points have been found, so the translittered russian would be hyphenated like the cyrillic - but one should do some testing to confirm this. — Ulrike Fischer, Feb 24 '12 at 08:38
@Ulrike: Of course! :-/ I'm being silly. --- It seems odd that the font mappings would be done strictly after hyphenation: the width of glyphs would then change after line breaking, and lines would not preserve their lengths. — Bruno Le Floch, Feb 24 '12 at 08:41
@Bruno: Finding hyphenation points and applying them are two different things. Also in the thread on the mailing list Jonathan also wrote that "the font mapping will be applied when the width of '--' is measured". I now also read the complete thread and it confirmed "... that font mappings (unlike traditional TFM-based ligatures) are completely invisible to TeX's line-breaking process.". But the thread is from 2005 so things can have been changed. — Ulrike Fischer, Feb 24 '12 at 08:51
@UlrikeFischer With \hyphenrules{russian} and \hyphenpenalty=-10000 I get no hyphen. — egreg, Feb 24 '12 at 10:27
@egreg: I have now tried too and get also no hyphenation. I have sent a message to the xetex mailing list. Perhaps Jonathan will tell how it works exactly. — Ulrike Fischer, Feb 24 '12 at 12:26
@egreg: That’s to be expected, as Russian hyphenation patterns (obviously) have no provision for transliterated text in Latin. Since the original poster uses ISO 9 which is very close to Czech and Slovak spelling, I would use hyphenation patterns for either of these languages. The best solution would of course be to devise hyphenation patterns for this transliteration scheme, as for the moment there are no patterns I know of for transliterations of any language (except for Chinese pinyin – and even then, it doesn’t take tone marks into account). — Arthur Reutenauer, Feb 24 '12 at 22:39
@ArthurReutenauer The trick of using Czech hyphenation is interesting, but language differences could be important. — egreg, Feb 24 '12 at 22:49
@egreg: Yes, we started that discussion on the XeTeX mailing-list, but I’m not convinced the difference in the position of legitimate breakpoints is that important that it would give significantly bad results (the – admittedly many – small differences in spelling don’t matter that much since we’re only concerned with finding reasonable breakpoints). Anyway, approximate hyphenation can’t possibly be worse than the second line of transliterated text above! — Arthur Reutenauer, Feb 24 '12 at 23:11
For the record, I just had a go at it and the only two bad breaks in the paragraph seem to be “cen-tr” and “ja-nvarja”. — Arthur Reutenauer, Feb 24 '12 at 23:22
This is not so bad, because typically, I need the transliteration table for different languages written in cyrillic. Generally, couldn't one just transliterate the Russian hyphenation patterns and apply them to Russian transliterated text? — Ruprecht von Waldenfels, Mar 02 '12 at 11:12
I'd like to thank @egreg for the answer, because it helped me so much. I posted this helpful question in my question closely related to this topic.
http://tex.stackexchange.com/questions/285610/create-a-mapping-for-transliteration-from-cyrillic-to-latin-in-lualatex — Paolo Polesana, Jan 03 '16 at 08:46
@PaoloPolesana It seems much easier for XeTeX than for LuaTeX — egreg, Jan 03 '16 at 10:38
Yes, @greg, it actually is; but with LuaLaTeX you can take advantage of all the feature of microtype package... — Paolo Polesana, Jan 03 '16 at 15:59
@egreg -- can you tell me what is the encoding scheme shown in the question which you have answered? (The OP hasn't been here for more than 5 years.) It's not Unicode, and it's not the T2 encoding (which is the Math. Reviews transliteration; since I did the TeX implementation for the AMS Russian-English Dictionary, I'm familiar with that). I don't believe that 1066, 1068, 1098 and 1100 should be the same; there's a real difference between hard and soft signs, which is what I suspect those are supposed to represent. — barbara beeton, May 09 '23 at 23:28
@barbarabeeton I just followed the OP’s transliteration scheme. I agree that this would require further research. — egreg, May 10 '23 at 08:03

Using XeTeX for automatic transliteration of cyrillic letters

1 Answers1

Linked