utf8x vs. utf8 (inputenc)

Question

I normally use \usepackage[utf8]{inputenc} for my latex document but on this site i saw a lot of code with \usepackage[utf8x]{inputenc}.

What are the differences between the 2 options ?

Is there one of the option obsolete and which one should I use ?

Collection of incompatibilities of utf8x: 1 (DeclareUnicodeCharacter) 2 (csquotes) 3 (hyperref) — user202729, Aug 14 '22 at 11:51
Note: in the newest version (August 2022) \usepackage[utf8x]{inputenc} does nothing, you need \usepackage{ucs} before it in addition. Refer to the update note in the manual in https://ctan.org/pkg/ucs for details. — user202729, Sep 13 '22 at 21:30

score 145 · Accepted Answer · edited Sep 13 '21 at 14:26

145

The simple answer is that utf8x is to be avoided if possible. It loads the ucs package, which for a long time was unmaintained (although there is now a new maintainer) and breaks various other things.

See egreg's answer to this question as well, which outlines how to get extra characters using the [utf8] option of inputenc.

Generally, however, the best way to deal with Unicode source (especially with non-latin scripts) is really XeLaTeX or LuaLaTeX.

There's an extended discussion of this here: Encoding remarks. See especially the comments by Philipp Lehman and Philipp Stephani.

edited Sep 13 '21 at 14:26

user202729

7,143

answered Mar 09 '11 at 17:18

Alan Munn

218,180

9

In some case utf8 breaks some things too when utf8x is fine. I work only with utf8 and utf8x and I don't remember why but after working with utf8 and some problems, now I work with utf8x. I agree with you about XeLaTeX or LuaLaTeX but some users need to work again with pdflatex. – Alain Matthes Mar 09 '11 at 18:39
usually I use LuaLaTeX, but for some stuff I don't have choice to use pdfLaTeX – PHL Mar 09 '11 at 19:58
The link to "encoding remarks" above seems to be broken, and I can't seem to find the right object. Does anyone know how to relocate this (apparently important) discussion? – acr Jan 18 '13 at 13:33
1

@acr I've fixed the broken link. Thanks for pointing it out. – Alan Munn Jan 18 '13 at 17:18
Don't forget to remove \usepackage[utf8]{inputenc} before compiling with XeLaTex (and BTW replace the fontenc package by fontspec). – Skippy le Grand Gourou Jul 04 '17 at 10:30

score 51 · Answer 2 · edited Jan 22 '13 at 09:08

51

In fact, utf8 may not be as restrictive as it seems: it only loads characters that can be displayed by the font encoding.

When typing

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

the font encoding is still OT1 when loading inputenc, which has very few characters. By using

\usepackage[T1]{fontenc} 
\usepackage[utf8]{inputenc}

you will allow all displayable utf8 characters to be available as input.

edited Jan 22 '13 at 09:08

einpoklum

12,311

answered Nov 08 '11 at 00:03

BSK

583

2

I think I am blind but I see no difference between those two lines... Except for what seems to be a typo in "userpackage" – Andriy Drozdyuk Mar 04 '12 at 22:01
25

@drozzy, the difference is the order of the commands. – maxschlepzig Jun 24 '12 at 16:08
5

Could you provide an example of a character unavailable in first case but available in the second case? – Denis Bitouzé Sep 24 '14 at 13:01
21

This is just wrong. First, it is just not true that the order affects the available input characters. This is easy to establish. Try the string æ ç ð â. Second, characters available in the input encoding are certainly not restricted to those in the output encoding. If they were, utf8x would not support typesetting ŵ ŷ with the T1 font encoding. (T1 does not include these characters. Yet they are acceptable input with utf8x and they can be typeset to display correctly by combining the accent with the vowel.) – cfr Sep 30 '14 at 01:51
@cfr How can a wrong answer can get 42 upvotes? :) – Dr. Manuel Kuehner Jun 13 '18 at 20:41
3

@Dr.ManuelKuehner This must be your first day on the internet: welcome ;). – cfr Jun 13 '18 at 21:59
1

@Dr.ManuelKuehner Actually, it has 46 upvotes. It also has 4 downvotes. – cfr Jun 13 '18 at 22:00
@Dr.ManuelKuehner The answer is no longer +42, so I believe it is wrong once more. – Mateen Ulhaq Aug 28 '21 at 05:55
Remark, it's "only" half wrong. utf8 indeed load only characters that can be represented in the font encoding, but if the font encoding is loaded later utf8 can still handle it. See https://tex.stackexchange.com/questions/82411/problem-with-%E1%BA%BD-tilde-and-utf8#comment1567595_82412 . // Besides, characters that can be displayed by the font encoding includes those that can be displayed with an accent, so the second issue in cfr's comment above is not a problem. – user202729 Jun 02 '22 at 15:22

egreg · Answer 3 · 2021-09-13T16:00:13.610

Don't use utf8x; with an up-to-date TeX distribution it could show necessary only for its most obscure features (faking characters with images from the Web, for instance).

The problem with Greek, which was probably the main reason for adopting utf8x instead of utf8, have since be solved and

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[polutonikogreek,english]{babel}
\begin{document}
This is english
\textgreek{Τηις ις γρεεκ}
This is english again.
\end{document}

will happily print

enter image description here

The occasional missing definitions can be coped with in a simple way. If you're able to input a Unicode character, such as the Welsh letters

Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï

or the Latin vowels with prosodic marks

Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ Āā Ēē Īī Ōō Ūū Ȳȳ

(y with breve is missing from Unicode, while a with breve is already defined by utf8 because it's a letter in Romanian), you can simply add the unknown ones to the list of known characters:

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{newunicodechar}
% missing Welsh coverage
\newunicodechar{Ŵ}{^W}
\newunicodechar{ŵ}{^w}
\newunicodechar{Ŷ}{^Y}
\newunicodechar{ŷ}{^y}
% Latin vowels with prosodic marks

\newunicodechar{Ĕ}{\u{E}}
\newunicodechar{ĕ}{\u{e}}
\newunicodechar{Ĭ}{\u{I}}
\newunicodechar{ĭ}{\u{\i}}
\newunicodechar{Ŏ}{\u{O}}
\newunicodechar{ŏ}{\u{o}}
\newunicodechar{Ŭ}{\u{U}}
\newunicodechar{ŭ}{\u{u}}
\newunicodechar{Ā}{=A}
\newunicodechar{ā}{=a}
\newunicodechar{Ē}{=E}
\newunicodechar{ē}{=e}
\newunicodechar{Ī}{=I}
\newunicodechar{ī}{={\i}}
\newunicodechar{Ō}{=O}
\newunicodechar{ō}{=o}
\newunicodechar{Ū}{=U}
\newunicodechar{ū}{=u}
\newunicodechar{Ȳ}{=Y}
\newunicodechar{ȳ}{=y}
\begin{document}
Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï
Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ
Āā Ēē Īī Ōō Ūū Ȳȳ
\end{document}

enter image description here

Note that, for instance, the line

\newunicodechar{Ŵ}{\^W}

can be also input as

\DeclareUnicodeCharacter{0174}{\^W}

without the need of the newunicodechar package, because U+0174 is the code point of LATIN CAPITAL LETTER W WITH CIRCUMFLEX; but \newunicodechar frees from looking up in the Unicode tables.

Update, April 2016

With a recent LaTeX kernel almost none of the definitions above is necessary, because T1enc.dfu has been updated and enriched. Of the accented letters in the last example, only Ȳ and ȳ need to be defined (and they'll possibly be included in next releases).

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{newunicodechar}
\newunicodechar{Ȳ}{=Y}
\newunicodechar{ȳ}{=y}
\begin{document}
Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï
Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ
Āā Ēē Īī Ōō Ūū Ȳȳ
\end{document}

Update 2021

Now all those accented letters are defined in the kernel. So the following works out of the box. Note that \usepackage[T1]{fontenc} is not strictly required; however, loading it is better because it contains many precomposed accented letters.

\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï
Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ
Āā Ēē Īī Ōō Ūū Ȳȳ
\end{document}

Just a note: it seems also ﬁ (LATIN SMALL LIGATURE FI) is not in [utf8]{inputenc}, only in [utf8x]{inputenc} (which in my case is good: it alerted me to ﬁ's existence - which is an artifact of copypaste - in a bibtex file, which the previous utf8x document didn't notice). — sdaau, Oct 16 '14 at 16:59
and do we have to do that by hand for the thousands of characters out there ???? — nicolas, Apr 24 '16 at 16:16
@nicolas Happily, newer LaTeX kernels contain many more combinations than it used to do two years ago. — egreg, Apr 24 '16 at 16:20
@egreg See my updated answer below. The code you've given is no longer required for the to bach on Welsh vowels. — cfr, Apr 24 '16 at 21:00
I stumbled by chance upon this answer: I can confirm that the remaining letters have been added. The MWE works now without any \newunicodechar. — campa, Jul 10 '19 at 09:29
Can you please help, if I want to use $\alpha$ instead of \alpha what do I change in the \userpackage? It is giving me an error on overleaf when I use $\alpha$ — Fareed Abi Farraj, Oct 08 '19 at 16:00
@FareedAF That's not really clear. Please, ask a new question on the main page. — egreg, Oct 08 '19 at 16:07
Update now: you no longer have to define Ŷŷ. (you may want to move the newer part to the top, or just delete the old part completely or note "for version ≥ X" -- there's the revision history — user202729, Sep 13 '21 at 15:32
What is the benefit of"precomposed accented letters"? I assume better for copy and paste? — Dr. Manuel Kuehner, Sep 13 '21 at 19:20
@Dr.ManuelKuehner Not only that, but also hyphenation, because composed accents block it. — egreg, Sep 13 '21 at 19:45

score 22 · Answer 4 · answered Mar 09 '11 at 18:30

22

"The simple answer is that utf8x is to be avoided if possible." Yes and No. No it's not so simple utf8x is sometimes necessary when you need to write greek or some special symbols. Yes utf8x is for a long time was unmaintained but we can use it. Try to compile the next code with utf8

\documentclass{article}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[polutonikogreek,english]{babel}

\renewcommand*{\textgreek}[1]{%
  \foreignlanguage{greek}{#1}%
}

\begin{document}

This is english
\textgreek{Τηις ις γρεεκ}
This is english again.

\end{document}

enter image description here

answered Mar 09 '11 at 18:30

Alain Matthes

95,075

23

utfx is not necessary for greek. utf8 is extensible: You need a lgrenc.dfu to get the definition for greek then utf8 will work fine. lgrenc.dfu can be found on the net (e.g. in the latex bug database if I remember it correctly). – Ulrike Fischer Mar 09 '11 at 18:50
2

@Ulrike: Thanks Ulrike, I don't know 'lgrenc.dfu' ! Is it recent ? Josselin Noirel writes something about "mem.sty" but I think this package disappears. \usepackage[charset=utf8,greek,english]{mem} – Alain Matthes Mar 09 '11 at 19:06
2

@Ulrike: Why lgrenc.dfu is not in TeXLive ? and I read : Based on a babel patch by Werner Lemberg, with input from the ucs package (ucsencs.def) by Dominique Unruh and CB.enc by Apostolos Syropoulos. So there is a part from ucs ! – Alain Matthes Mar 09 '11 at 19:56
I have no idea why it never found its way to CTAN. But it is not part of ucs, it only reused some (or a lot) definitions from it (writing all this long list which maps a input char to a command is a lot work). lgrenc.dfu is quite similar to standard lists like t1enc.dfu and utf8enc.def which are part of basic latex/inputenc+utf8. – Ulrike Fischer Mar 09 '11 at 20:31
1

Useful: http://milde.users.sourceforge.net/LGR/ – Nikos Alexandris Feb 10 '12 at 10:38
10

In TeX Live 2012 one can say \usepackage[LGRx,T1]{fontenc} and then \usepackage[utf8]{inputenc} will enable direct usage of Greek characters. – egreg Jan 18 '13 at 17:30
3

And since TeX Live 2013, LGRx is not necessary any more (it should actually be removed, see lgrxenc.def not found – egreg Sep 29 '14 at 23:14
With current TeX Live the result with utf8 is the same as the shown, isn't it? – Schweinebacke Oct 10 '17 at 07:08

einpoklum · Answer 5 · 2014-01-26T11:45:17.773

13

I've had the experience of not being able to compile Hebrew with utf8, only with utf8x, using pdflatex in MikTeX (e.g. 2.9). Many guides on writing Hebrew LaTeX suggest using utf8x:

This is not to contradict what the learned sages say above, it's just an example of a case in which it seems to be impossible to avoid (unless someone suggests a way like Ulrike's suggesting regarding Greek).

Note: This answer is only relevant to pdfTeX+Babel, not XeTeX+Polyglossia.

edited Jan 26 '14 at 11:45

answered Sep 23 '11 at 06:58

einpoklum

12,311

3

Frankly, if you are working with Hebrew, you should really be using XeLaTeX and the bidi package. – Alan Munn Sep 23 '11 at 13:25
1

Well, I was talking in the past tense... although, frankly, I haven't made the switch yet. Still find it important to mention it here, as people looking up the difference between the two options will probably hit this question. – einpoklum Sep 29 '11 at 15:03
XeLaTeX must have been a great improvement for people writing in Hebrew. I don't think the incompatibility between babel with the hebrew option and amsthm was ever fixed (I don't write Hebrew myself, so I never bothered to check) – kahen Nov 08 '11 at 01:29
@kahen: It's true fundamentally, but the package support is still lacking - even more with XeLaTeX+polyglossia than with pdfLaTeX+babel. – einpoklum Dec 13 '13 at 11:03
As of 2021, Hebrew now works for either Babel or Polyglossia in either LuaTeX or XeTeX (I normally use Babel in LuaTeX myself), but the legacy ivritex or culmus-latex packages for PDFTeX still require utf8z. – Davislor Sep 13 '21 at 22:43

cfr · Answer 6 · 2016-04-24T20:54:20.667

11

Update The utf8x option is no longer required.

The following now works unproblematically (except that certain characters are composed on-the-fly, but that's another matter).
\documentclass[welsh]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\begin{document}

 Ââ Êê Îî Ôô Ûû Ŵŵ Ŷŷ Ïï

\end{document}

Original answer

This answer is now deprecated and applies only for older installations. If your installation of TeX is current, you should not need utf8x any longer. utf8 should now be sufficient.

Welsh requires utf8x as the characters ŵ and ŷ are not recognised otherwise:

\documentclass[welsh]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8x]{inputenc}
\usepackage{babel}
\begin{document}

 Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï

\end{document}

Obviously, XeLaTeX is an option although babel is still required as polyglossia doesn't support Welsh.

edited Apr 24 '16 at 20:54

answered Jun 12 '14 at 23:51

cfr

198,882

\usepackage{newunicodechar}\newunicodechar{ŵ}{\^w} (works with the utf8 option). Of course this breaks hyphenation, but it's the same with utf8x – egreg Jun 13 '14 at 07:29
@egreg Yes, but then you have to add it for every additional accented character you need whereas utf8x is just typing an additional x. It doesn't break hyphenation generally, though - only for cases where words containing those characters would be hyphenated? Is that right? (Those words will not be very common. ^ generally occurs on short words or words which are hyphenated anyway. gwdihŵ would be an example but the most common cases are things like tŷ.) – cfr Jun 13 '14 at 22:14
2

I wouldn't use a package that's known to be fragile and that breaks easily just in order to avoid a couple of definitions. You should make a feature request to the LaTeX team that they add the Welsh characters to utf8. – egreg Jun 13 '14 at 22:22
@egreg I finally figured out how to do that from David Carlisle. Hence, http://www.latex-project.org/cgi-bin/ltxbugs2html?pr=latex/4400. – cfr Sep 29 '14 at 23:06
While it may be annoying having to do that for all documents in Welsh, you can see from my answer that all it takes is five lines of code. – egreg Sep 29 '14 at 23:45
@egreg I don't claim to be entirely rational about this. However, I'm currently only typesetting small amounts of Welsh and it doesn't include any of those characters so right now I'm neither using utf8x nor declaring the additions. Though clearly this situation is fragile in itself. – cfr Sep 30 '14 at 00:19
I can't believe anyone who wants to use obscure characters like Σ ⇛ ⇒ would have to do this by hand. that's so wrong.. – nicolas Apr 24 '16 at 16:18
@nicolas I don't think I have to any more. Despite David's initial scepticism, I think ŵ, ŷ etc. are now included. For the mathematical symbols, though, I guess it is a bit different. Do you really type those directly rather than using macros? That wouldn't even occur to me, to be honest. – cfr Apr 24 '16 at 18:12
@cfr yes I input them directly. haskell, agda, among a few languages, perfectly support that. It is so amazing that a system whose purpose is dedicated to such specialized typing would not support it natively.... I am fuming.. – nicolas Apr 24 '16 at 19:46
@nicolas Why don't you use XeTeX or LuaTeX? Really, it is not that surprising that traditional TeX engines don't support this given the time at which they were written and the kind of computer resources available. – cfr Apr 24 '16 at 20:40
@cfr I tried and it did not work, but your suggestion made me loot at it again. apparently there are some form of initialization steps I did not perform correctly. thanks ! – nicolas Apr 25 '16 at 07:08

user202729 · Answer 7 · 2022-10-01T09:43:26.210

0

Just to point out another issue with utf8x, which I come across while digging through the documentation of the "historical" package inpmath: the kerning is "more likely" to be wrong.

%! TEX program = pdflatex
\documentclass{article}
% either...
\usepackage{ucs}
\usepackage[utf8x]{inputenc}
% or...
%\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
% unfortunately if the encoding is OT1 then the kerning will break for both package
\begin{document}
VÁ
V'A
\end{document}

In the first case only inputenc gives the correct kerning, ucs does not. In the second case you can see the correct kerning.

Output with ucs:

Output with inputenc:

The first row uses the Unicode character in the input directly, the second row uses the TeX command form \'A (which always work).

Normally fonts should set the kerning so that the gap between V and A are smaller, but the effect does not work in the ucs case as shown above.

The underlying reason is that utf8x executes some unexpandable command to process each Unicode character.

edited Oct 01 '22 at 09:43

answered Oct 01 '22 at 01:46

user202729

7,143

It’s not clear what you mean by “only inputenc gives the correct kerning”. – egreg Oct 01 '22 at 07:58
@egreg Just compile and see like that? – user202729 Oct 01 '22 at 09:43
I wasn't doubting; the problem is that the phrase I mentioned is very unclear. – egreg Oct 01 '22 at 09:49
1

The real problem is that uni-0.def is loaded when Á is scanned, not before processing text. What's missing is a link between output encodings and uni-n.def files. The ucs package should load at begin document the def files corresponding to the wanted output encodings (like it's done with the dfu files when using the utf8 option). One can “solve” the issue by adding something like \PrerenderUnicode{Á} in the preamble or at begin document. – egreg Nov 17 '22 at 10:12

utf8x vs. utf8 (inputenc)

7 Answers7

Update, April 2016

Update 2021

Original answer

Linked

Related