19

I typeset my work with pdflatex and UTF-8 input files with lots of Unicode characters. Most of them work with a simple \usepackage[utf8]{inputenc} in my preamble, and for the others I simply maintain a long list of \DeclareUnicodeCharacter.

However, I often paste into my LaTeX files some text that has combining accents (that results from Mac OS X’s copy/paste mechanism). I have to normalize that text, because combining accents don’t work out of the box. How could I manage to make them work? (once and for all, e.g. by adding good definitions near my big list of Unicode characters)


Minimal self-contained example:

\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
Élève
Élève
\end{document}

where the text with accents is:

U+00C9  É  LATIN CAPITAL LETTER E WITH ACUTE
U+006C  l  LATIN SMALL LETTER L
U+00E8  è  LATIN SMALL LETTER E WITH GRAVE
U+0076  v  LATIN SMALL LETTER V
U+0065  e  LATIN SMALL LETTER E
U+000A     NEWLINE
U+0045  E  LATIN CAPITAL LETTER E
U+0301  ́  COMBINING ACUTE ACCENT
U+006C  l  LATIN SMALL LETTER L
U+0065  e  LATIN SMALL LETTER E
U+0300  ̀  COMBINING GRAVE ACCENT
U+0076  v  LATIN SMALL LETTER V
U+0065  e  LATIN SMALL LETTER E
F'x
  • 3,011
  • See package newunicode – Marco Daniel Oct 27 '12 at 19:58
  • @MarcoDaniel (it's newunicodechar) I think it's only a convenience macro over \DeclareUnicodeCharacter, which can take one codepoint, but not a series of two codepoints. I tried, and it complains: Package newunicodechar Error: Invalid argument. – F'x Oct 27 '12 at 20:01
  • Indeed. I finished my comment to early ;-) – Marco Daniel Oct 27 '12 at 20:02
  • 4
    I think the answer is "No" (But the system wouldn't let me give an answer that short:-) – David Carlisle Oct 27 '12 at 20:11
  • @DavidCarlisle I actually had one idea: is it possible to define a macro which would peek at the previous character in the input stream? Unicode combining accents follow the letters, so I have E + combining grave accent. If the order where reversed, it would be trivial (just make the combining grave be \'). – F'x Oct 27 '12 at 20:13
  • 4
    yes but you can't go back, you can in simple case write a macro that parses the entire text stream re-ordering tokens when it sees a combining character, but it would be very fragile and likely break most other package commands. If your accented letters are single characters in Unicode form NFC then normalising the input before passing to TeX will be a lot more robust. – David Carlisle Oct 27 '12 at 20:17
  • Why don't you use XeLaTeX or LuaLaTeX? – Martin Schröder Oct 27 '12 at 20:51
  • @MartinSchröder same reasons I don't use MS Word: 1. my existing procedures are in place and mostly satisfactory, 2. my two attempts at migration (a year or two ago) resulted in a lot of frustration. Plus an added third: my colleagues all use latex, having them move to something newer has a cost. – F'x Oct 27 '12 at 20:58

3 Answers3

9

If you are prepared to use an external tool then a perl script will standardise this type of encoding for you.

You can find recipes at the Perl Unicode Cookbook by Tom Christensen.

Combine the standard preamble (recipe R0) followed by recipe R1 "Generic Unicode-savvy filter" (you can remove the ... } continue { part of the code). Put this in a file normalise.pl, give it excute premissions via chmod +x normalise.pl and use as normalise.pl file.tex >out.tex.

I would post such a script normalise.pl here, but my understanding of the license on the cookbook is that is not allowed.

Andrew Swann
  • 95,762
5

In short, the answer to the question is "No".

TeX does not let you go back in the token stream if you have cafe in the horizontal list then detect a Unicode combining acute, it's too late, you can not remove the e and replace it with an accented character. You can in simple cases write a macro that parses the entire text stream re-ordering tokens when it sees a combining character, but it would be very fragile and likely break most other package commands.

If your accented letters are single characters in Unicode form NFC then normalising the input before passing to TeX will be a lot more robust.

If you use a Unicode aware tex such as luatex or xetex, then the accent combination is handled below TeX's token handling in the font system, in a similar way that ligatures such as ff are handled without the TeX macro layer having any control over this.

David Carlisle
  • 757,742
1

LuaTeX and XeTeX can both handle this, if you are able to use them. In a few cases (involving hyphenation of ancient Greek), I have needed to use \usepackage{inputnormalization} to automatically convert the TeX source into a canonical form that my entire toolchain understands. (This isn’t supposed to be necessary, according to the Unicode standard, but not all fonts or hyphenation patterns conform to the standard.) Unfortunately, the package only works in LuaTeX or XeTeX.

What you usually want to do is convert your source file to precomposed (NFC) form. I wrote a little program to do this a while ago.

If you must use a legacy 8-bit engine and characters that have no precomposed form in Unicode, you will need to replace the combining characters with TeX commands, for example, \d{C}.

Davislor
  • 44,045