Why characters UTF-8 encoded are ISO-8859-1 encoded when written in an external file

Question

For an UTF-8 encoded file as follows:

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
%
\newwrite\outtmp
\immediate\openout\outtmp=out.tmp
%
\begin{document}
\newcommand{\foo}{Résumé}
\foo
\immediate\write\outtmp{\foo}%
\end{document}

the out.tmp output file is ISO-8859-1 encoded. Why:

not UTF-8?
ISO-8859-1 and not something else?

Since you've told LaTeX that the input is utf8, it's able to map it to its internal representation, then output it in whatever encoding is applicable for output. (Why that is Latin-1 and how to change it, I couldn't say). — alexis, Dec 08 '13 at 21:43
It doesn't answer your question, but when you exchange inputenc with luainputenc and translate it with lualatex, then the out.tmp is UTF8-encoded. — knut, Dec 08 '13 at 21:57
@alexis Sorry I don't understand "whatever encoding is applicable for output". — Denis Bitouzé, Dec 08 '13 at 22:06
You are using pdfTeX for this? If so, remember that it's an 8-bit program. — Joseph Wright, Dec 08 '13 at 22:07
@knut Indeed, and it works also with XeLaTeX when replacing inputenc and fontenc by fontspec. — Denis Bitouzé, Dec 08 '13 at 22:08
@JosephWright Yes: I'm running pdflatex. I guess you meant it is not an 8-bit program, don't you? — Denis Bitouzé, Dec 08 '13 at 22:10
I meant that perhaps there are ways to change the encoding of the output. But default TeX is a 7- or 8-bit program, so it makes sense that whatever it reads, it stores and outputs in an 8-bit format (which ISO-8859-N is, but utf-8 is not). I'm speaking without knowing the real technical facts, but your odds of getting what you want would be better if you did not tell TeX it's reading utf-8-- since then it would not know how to convert it, and just might keep it intact. (But the generated document would almost certainly be incorrect). — alexis, Dec 08 '13 at 22:12
@DenisBitouzé I meant exactly what I said: pdfTeX is 8-bit, and as alexis says an internal representation ('LICR') is used to handle UTF-8 characters which are not single byte ones. Thus reading a UTF-8 file involves turning the two or more bytes that pdfTeX 'sees' into something that can be handled as a single char. — Joseph Wright, Dec 08 '13 at 22:17
@alexis I learned something: I was sure utf-8 stores and outputs in an 8-bit format. And you're right, not telling TeX it's reading utf-8 does the trick. — Denis Bitouzé, Dec 08 '13 at 22:30
Well, Joseph presumably meant that pdflatex is restricted to 8-bit characters. You could say utf-8 is an "8-bit format" in the sense that it uses all 8 bits of a byte (as opposed to ASCII, which is a 7-bit encoding because it never uses the high bit). But utf-8 needs multiple bytes per character (there's no way the thousands of characters of unicode could be mapped to the 256 8-bit values). Unicode is nominally 16-bit (though the "extended plane" goes higher), and utf-8 uses anywhere from 1 to three bytes (8 to 24 bits) to code one unicode character. — alexis, Dec 08 '13 at 23:20
LaTeX doen't know anything about encodings on output to a file. It just writes whatever bytes you give it, subject to expansion of control sequences and active characters. That is the problem with inputenc for most encodings: some characters are made active to translate characters into latex notation. If that were not the case, it would happily write multibyte characters byte-by-byte. — Dan, Dec 09 '13 at 02:32

egreg · Answer 1 · 2013-12-08T22:24:06.943

How does LaTeX implement UTF-8?

The Unicode character é is encoded as two byte in UTF-8, precisely <C3><A9> (I'll use throughout this to denote bytes, also when they are character tokens for TeX). When \usepackage[utf8]{inputenc} is loaded, the byte <C3> is made active and defined to look for the following byte, because <C3> in UTF-8 marks a two byte character.

So LaTeX gathers <A9> and forms the control sequence

\csname u8:\string<C3>\string<A9>\endcsname

which is defined to expand to

\IeC {\@tabacckludge 'e}

One can see it from

\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
\expandafter\show\csname u8:\string^^c3\string^^a9\endcsname

where ^^c3 is TeX's way to express what I denote by <C3>. On the terminal we get

> \u8:é=macro:
->\IeC {\@tabacckludge 'e}.
<recently read> \u8:é 

l.4 ...r\show\csname u8:\string?\string?\endcsname

(the é in the first line is because my terminal is set up for UTF-8).

What does `\write` do?

The operation \write takes a first argument denoting the output stream and a braced second argument, which is fully expanded when the write operation is actually performed. So we need to know what \IeC and \@tabacckludge do.

Adding \show\IeC and \makeatletter\show\@tabacckludge to the above example shows, on the terminal, first

> \IeC=macro:
->\ifx \protect \@typeset@protect \expandafter \@firstofone \else \noexpand \IeC \fi .

and then

> \@tabacckludge=macro:
#1->\expandafter \@changed@cmd \csname \string #1\endcsname \relax .

OK, we'd need also \@changed@cmd, but in essence it simply does the equivalent of \'e, since we're not in a tabbing environment.

In your case, \protect is \@typeset@protect, as it is normally; so when we do

\write\openout{é}

we first get

\IeC{\@tabacckludge 'e}

and, since the conditional is true, this becomes

\@firstofone{\@tabacckludge 'e}

which in turn becomes

\@tabacckludge 'e

and then

\'e

This one triggers a complex development, which eventually ends into

\char223

because of the declaration

\DeclareTextComposite{\'}{T1}{e}{233}

in t1enc.def that has been loaded by saying \usepackage[T1]{fontenc}. Only now TeX actually writes something, precisely byte number 233 (in decimal), that is, byte <E9>.

It's not really a coincidence that <E9> in Latin-1 is exactly é, because the T1 encoding has many slots in common with Latin-1. Not all.

How do we write UTF-8 with LaTeX (as opposed to (Xe|Lua)LaTeX)?

You don't want the expansion takes place:

\write\outtmp{\unexpanded{Résumé}}

or, without using \unexpanded,

\toks0={Résumé}
\write\outtmp{\the\toks0}

Example

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\begin{document}
\newwrite\outtmp
\immediate\openout\outtmp=\jobname.tmp

\immediate\write\outtmp{Résumé}
\immediate\write\outtmp{\unexpanded{Résumé}}
\toks0={Résumé}
\immediate\write\outtmp{\the\toks0 }

\stop

The result of less from the written out file is

R<E9>sum<E9>
Résumé
Résumé

(always because the terminal is UTF-8). Without interpretation I get

R<E9>sum<E9>
R<C3><A9>sum<C3><A9>
R<C3><A9>sum<C3><A9>

So the first line is the wrong one, while the other two are as expected.

Hiding Résumé in a macro just makes things more difficult, because you want to expand it. So

\write\outtmp{\unexpanded\expandafter{\foo}}

will do.

What else?

If you use \protected@write, then things are different: with

\protected@write\outtmp{}{Résumé}

you get written

R\IeC {\'e}sum\IeC {\'e}

because in this case \protect is not \@typeset@protect, so the false branch is followed. The complex transformation of \@tabacckludge 'e ends up with \'e because of the same reason regarding \protect. This might be or not what you want. Surely that token list prints as “Résumé”.

This answer saved my day! \write\outtmp{\unexpanded\expandafter{\foo}} is the solution to a recent problem I had. Thank you. — Thomas F. Sturm, Apr 13 '16 at 17:07
In this case: no. But I want to migrate my documents to UTF-8 and I wonder what more pitfalls are coming (?). Now, this one is closed thanks to your answer :-) — Thomas F. Sturm, Apr 14 '16 at 09:50

score 6 · Accepted Answer · answered Dec 10 '13 at 23:19

The main point here is that the input file is not LaTeX but a mixture of LaTeX and plain TeX or rather TeX primitives that are not supported in LaTeX files (ok they appear inside packages and inside the kernel, but using them means one has to understand the (sometimes not properly documented limitations and the conventions to get around them).

So TeX is writing as if it is typesetting (given a certain font encoding (not file encoding)), @egreg described that perfectly in his answer.

LaTeX goes a long way to define an internal encoding that can be translated to various font encodings or other encodings and if you would want to get utf-8 output you would need to switch first to to a "font" encoding that outputs utf-8, ie that translates stuff like \'e (which is an LICR = LaTeX Internal Character Encoding) to the two utf-8 bytes. Now that doesn't exist but technically it would be possible to set up properly.

The use of \protected@write is the LaTeX way to output stuff transparently on 7bit so that it can be read back in regardeless of the input encoding in force at that time

Why characters UTF-8 encoded are ISO-8859-1 encoded when written in an external file

2 Answers2

How does LaTeX implement UTF-8?

What does `\write` do?

How do we write UTF-8 with LaTeX (as opposed to (Xe|Lua)LaTeX)?

What else?

Linked

Why characters UTF-8 encoded are ISO-8859-1 encoded when written in an external file

2 Answers2

How does LaTeX implement UTF-8?

What does \write do?

How do we write UTF-8 with LaTeX (as opposed to (Xe|Lua)LaTeX)?

What else?

Linked

What does `\write` do?