Why does inputenc abandon so quickly under "utf8 based engines"?

Question

Why do I need to do the extra work starting with \ifdefined in order to get my French guillemets correct in the pdf output, when using xelatex with a source specifying the use of T1-encoded fonts ?

\documentclass[french]{article}

    \usepackage[T1]{fontenc}
    \usepackage[utf8]{inputenc}

\ifdefined\XeTeXinterchartoks
     \catcode`« \active
     \catcode`» \active
     \def«{\char19 }
     \def»{\char20 }% ça marche, même avec Babel+frenchb
\fi

\usepackage{newtxtext}

\usepackage{babel}
\frenchbsetup{og=«, fg=»}

\begin{document}

\showboxbreadth\maxdimen
\showboxdepth\maxdimen
\showoutput

«coucou»
\end{document}

The log contains:

Package: inputenc 2015/03/17 v1.2c Input encoding file
\inpenc@prehook=\toks14
\inpenc@posthook=\toks15


Package inputenc Warning: inputenc package ignored with utf8 based engines.

But it is loaded after fontenc. It is not forbidden to use fontenc with xelatex. inputenc is loaded after it. Thus it should know that T1-encoded font slots are to be used. Why then doesn't it do the job of making these characters active and map them to the suitable \char xx slots ?

There is something escaping me here...

Notice that the code sample also uses babel+frenchb which adds automatic spacing. It seems not to have been perturbed from my making the characters active.

In order to explain more the issue, consider the following input:

\documentclass{article}

    \usepackage[T1]{fontenc}
    \usepackage[utf8]{inputenc}

\begin{document}

\showboxbreadth\maxdimen
\showboxdepth\maxdimen
\showoutput

«coucou»

\end{document}

It produces, if compiled with xelatex:

The explanation is simple: the ascii chars « and » are in slots 171 and 187 respectively. Hence the corresponding glyphs from the T1 encoding are used, giving the result. inputenc does nothing, but it could have donc something akin to my code above.

...\hbox(6.63332+0.0)x345.0, glue set 290.00977fil
....\hbox(0.0+0.0)x15.0
....\T1/cmr/m/n/10 «
....\T1/cmr/m/n/10 c
....\T1/cmr/m/n/10 o
....\T1/cmr/m/n/10 u
....\T1/cmr/m/n/10 c
....\T1/cmr/m/n/10 o
....\T1/cmr/m/n/10 u
....\T1/cmr/m/n/10 »

Note that if you use default texlive settings, hyphenation patterns for T1 are not loaded into xetex or xelatex so hyphenation will be incorrect with T1 fonts. why use newtxtext with xetex? rather than Times ? — David Carlisle, Dec 11 '15 at 18:30
@DavidCarlisle No specific reason except from where the situation came from. Good to know about the missing hyphenation patterns, didn't know about that. Does this mean then that xelatex should only be used with Unicode fonts ? — , Dec 11 '15 at 18:33
@jfbu simple answer yes, longer answer it would be possible to load T1 hypenation patterns but it complicates things (and there is no "out of the box" setup for that) and doing it is very low priority as almost all relevant fonts are available as opentype fonts by now. — David Carlisle, Dec 11 '15 at 18:34
Thanks David for all the explanations. I was not aware that xelatex had such grave lacunae in support of classic 256 slots fonts. So far, I had not seen any authoritative advice : "use xelatex only with opentype fonts". (and what about lualatex then ?) — , Dec 11 '15 at 18:54
I think you are looking for the xetex equivalent of luainputenc. The xetex-inputenc package referenced there unfortunately only does anything useful with 8-bit input. — Robert, Dec 11 '15 at 20:17
@jfbu same applies to luatex, although luatex can load patterns into the document so (in principle) it doesn't require a spcial format in that case. — David Carlisle, Dec 11 '15 at 20:38
@Robert yes indeed, this appears to be exactly that. I think luainputenc adresses (with quite some years of anticipation...) precisely the issue I was raising. — , Dec 11 '15 at 20:52
PS. 7 years ago, there was a lengthy discussion about inputenc's future with UTF-aware engines here. — Robert, Dec 11 '15 at 20:52
I have yet to see any use case for using newtextext (which is a time clone) in xetex. the only reason given so far is font compression but that's weird, font compression is done by the dvi driver not by TeX. Do you have an example where xetex+xdvipdfmx makes pdf files significantly smaller than tex+dvipdfmx? — David Carlisle, Dec 12 '15 at 12:44
@DavidCarlisle no I don't, but the reason for xelatex was for easier png graphics inclusion. The latex source is not created directly by me, (I explain in a comment to Ulrike's answer) and I didn't want to engage into figuring out the bounding boxes and how to let the Sphinx ReST to latex converter incorporate it. Thus I switched to xelatex (I needed some work to obtain from Sphinx a xelatex compatible preamble). Circa 484Ko vs 620Ko. With Libertine the ratio was more like 1 to 2. The comparison is with pdflatex not with latex+dvipdfmx. — , Dec 12 '15 at 13:08
@DavidCarlisle the advantage of 8bit legacy fonts for me is that I want to compile on various platforms to identical result. If I started using for example Menlo font on my Mac, I would not have it on my linux box at the office. — , Dec 12 '15 at 13:11
You cannot ensure, that the result on various platforms is the same. If you want that, you need to ensure the exact same binaries and package versions. Not to mention the exact same font. — Johannes_B, Dec 12 '15 at 13:17
@Johannes_B I hope the differences will be minute if I am using TL2015 on both; but more important for me than to the Å identity is the possibility to compile; again, I could use very pleasing opentype fonts on my Mac system, but then I would not have them at other locations (nor do I have the possibility to install them). — , Dec 12 '15 at 13:21
@DavidCarlisle as per the graphics inclusion, thanks to your graphicx I know I don't have to change the source, it is enough to have image.bb files available. As I am doing this on a conclusion of a project, I could indeed create and add these files to the suitable location, and then the latex + dvipdfmx road could be followed. But sphinx.sty has a bug and always passes pdftex as driver to graphicx. I would have to patch that. See https://github.com/sphinx-doc/sphinx/issues/2164 — , Dec 12 '15 at 13:25
@jfbu If it is just about compiling getting a similar output ... I usually have a simple conditional in my larger documents, running pdflatex for a quick look, and lualatex for the high-end version. This (and maybe checking for the existence of a file) seems to be much more simple than to fiddle with encodings. — Johannes_B, Dec 12 '15 at 13:47
@Johannes_B Dear Johannes, my problem was already entirely fixed prior to me asking here with my char 19 /20, as I only had a specific issue with « and ». I came here out of curiosity to learn why inputenc did nothing. Strong arguments to justify it, I am learning thanks to the comments. But they don't totally explain so far why inputenc decided not to do what luainputenc (seems to, I have been so busy harassing folks here and also I am doing other things that I haven't yet examined closely) does. — , Dec 12 '15 at 13:52
inputenc maps non-ascii to ascii commands. XeTeX can handle unicode, so there really is no need to map to ascii. I am sorry, i have read every comment to all answers, but i still don't know what you really mean. — Johannes_B, Dec 12 '15 at 13:55
@DavidCarlisle peripheral to this, compiling this (which sort of arose on the site) \documentclass{article}\usepackage{amsmath}\begin{document} \[\max_{(s'\to s)\Rightarrow u_{i}=1}\bigl(\tilde{\delta}_{i}(s',s)\bigr)\] \end{document} gives: 64011 butes with pdflatex, 63551 bytes with lualatex and 8390 bytes with xelatex. (figures vary) — , Dec 15 '15 at 18:42
@jfbu and 8384 bytes with latex + dvipdfmx the font compression is a feature of the dvi driver not of xetex, so you can get the same with latex. — David Carlisle, Dec 15 '15 at 20:05
@DavidCarlisle yes, I had forgotten to check latex+dvipdfmx. Only some months ago did I become aware that xelatex was as good at font compression as latex+dvipdfmx, which explains why I am in a phase currently of experimentation with xelatex. (which simplifies a bit the graphics aspects and naturally has the great features of handling system fonts). — , Dec 15 '15 at 20:36
@jfbu one of your comments was about a Menlo font on a mac. Can't you simply copy it to your Linux Box into some place xetex (or fontspec) will find it ? — , Nov 12 '16 at 22:48
@jfbu well thanks for the suggestion. In fact I did that indeed. In the meantime I had switched to using lualatex for that document. I put the font in a texmflocal repertory (on my private install of tl2016). Did not think about trying with xetex (I had forgotten I was using xetex then). Anyway I guess accessing via filename Menlo.ttc would work fine. Thanks again for the nice suggestion. — , Nov 12 '16 at 22:51

score 26 · Answer 1 · edited Apr 13 '17 at 12:35

26

inputenc is abandoned because it does absolutely nothing with XeTeX or LuaLaTeX. Better said, it would do bad!

See fontenc vs inputenc

Essentially, the task performed by inputenc is translating input characters into their LICR form. With an 8 bit engine, « is two byte long and inputenc is able to translate them into \guillemotleft and » into \guillemotright. But for doing so it must make some characters active. Which is exactly what you do later on, and inputenc is not instructed to do, because it's thought for an 8 bit engine.

I added a friendlier interface with newunicodechar.

\documentclass[french]{article}

\usepackage[T1]{fontenc}
\usepackage{newunicodechar}

\newunicodechar{«}{\guillemotleft}
\newunicodechar{»}{\guillemotright}

\usepackage{newtxtext}

\usepackage{babel}
\frenchbsetup{og=«, fg=»}

\begin{document}

«coucou»

\end{document}

If your aim is to provide translations for the characters in t1enc.dfu, then you can use it in a different way.

\documentclass[french]{article}

\usepackage[T1]{fontenc}
\usepackage{newunicodechar}

\newcommand\DeclareUnicodeCharacter[2]{%
  \expandafter\newunicodechar\Uchar"#1{#2}%
}
\input{t1enc.dfu}

\usepackage{newtxtext}

\usepackage{babel}
\frenchbsetup{og=«, fg=»}

\begin{document}

«coucou»

\end{document}

A proof of concept for a package `xeinputenc`

\ProvidesPackage{xeinputenc}[2015/12/12]
\RequirePackage{newunicodechar}

\AtBeginDocument{\xeinputenc@process}

\newcommand{\xeinputenc@process}{%
  \begingroup
  \gdef\xeinputenc@list{}%
  \def\cdp@elt##1##2##3##4{%
    \g@addto@macro\xeinputenc@list{\lowercase{\xeinputenc@input{##1}}}%
  }%
  \cdp@list
  \aftergroup\xeinputenc@list
  \endgroup
}

\newcommand{\DeclareUnicodeCharacter}[2]{%
  \expandafter\newunicodechar\Uchar"#1{#2}%
}

\newcommand{\xeinputenc@input}[1]{%
  \InputIfFileExists
    {#1enc.dfu}
    {\wlog{... processing UTF-8 mapping file for font encoding #1}\catcode`\ 9\relax}%
    {\wlog{... no UTF-8 mapping file for font encoding #1}}%
}


\@onlypreamble\DeclareUnicodeCharacter
\@onlypreamble\xeinputenc@list
\@onlypreamble\xeinputenc@process
\@onlypreamble\xeinputenc@input
\endinput

Now your test document can be

\documentclass[french]{article}

\usepackage{xeinputenc}

\usepackage{newtxtext}

\usepackage{babel}
\frenchbsetup{og=«, fg=»}

\begin{document}

«coucou»

\end{document}

No explicit loading of fontenc is needed in this case, because this is already taken care of by newtxtext, but calls to it will be honored.

edited Apr 13 '17 at 12:35

Community

1

answered Dec 11 '15 at 18:21

egreg

1,121,712

I am asking why inputenc does not do anything useful when detecting the user choice of T1 encoded fonts (a detection that it does). This is my question "why does it abandon so quickly". Someone has to do the job, and inputenc rather than doing nothing at all, could do something. I am not asking inputenc to insist doing the things it does under pdflatex: I am asking why does it not do something rather than nothing. There is an opportunity for inputenc to do something useful, if one follows the recommendation of loading it after fontenc. – Dec 11 '15 at 18:39
@jfbu: I now see what you mean. Well it wouldn't a good idea if inputenc would make a lot chars active only because it sees T1-encoding. Your wishes are rather special (and not really recommendable) and not something for a standard package. – Ulrike Fischer Dec 11 '15 at 18:45
@UlrikeFischer yes, usually about 128 or 128+32 characters could be made active to use the appropriate slots in the 256 slots font, alongside the core 7bit ascii chars. This is what I am asking. – Dec 11 '15 at 18:49
@jfbu inputenc is not designed to do that! – egreg Dec 11 '15 at 18:51
yes, but this is exactly what i asked ? "why does inputenc so stubbornly stick to its original design, which in the case at hand had the only logical conclusion to do nothing at all!! " but the situation being so much different, the raison d'être of the initial design from other contexts having vanished, isn't it time to re-invent something... that was the whole point of my question, and after much resistance I can now applaud that you do provide a method, less naïve than the one in my OP. Thanks ! – Dec 11 '15 at 19:03
This looks promising, how would that compare to what luainputenc does ? – Dec 12 '15 at 11:27
@jfbu The documentation says that the package can be useful if “Your source is not encoded in UTF-8 and you don’t want to reencode it for some reason.” However, it can also be used with a UTF-8 encoded font, possibly with the features you're asking for, but I don't think it can be ported to XeTeX. – egreg Dec 12 '15 at 11:31
I meant, for the case of "Your document is using legacy 8-bit fonts (with fontenc),", does your approach accomplish similar things ? (well, I will look closer). My interest here is with using xelatex+8bit legacy fonts+utf8 encoded source. – Dec 12 '15 at 11:34
@jfbu I supplied a new proof-of-concept package – egreg Dec 12 '15 at 12:07

score 10 · Answer 2 · answered Dec 11 '15 at 18:24

10

inputenc's utf8 option is designed to take sequences of characters representing the bytes in utf8 representation as individual characters and collect them together and use the utf8 encoding to expand each such sequence into a suitable tex command for that character.

When a utf8 file is read by xetex, each character is reported as a single character token and the bytes in the utf8 encoding are not reported at all to the macro layer so the inputenc code can do nothing useful.

answered Dec 11 '15 at 18:24

David Carlisle

757,742

inputenc normally takes into account the loaded font encodings. My question is: why does its utf8 option only target Unicode encoding ? that it does so is a choice. – Dec 11 '15 at 18:29
@jfbu no, inputenc is independent of font encodings. – David Carlisle Dec 11 '15 at 18:31
@jfbu utf8 option decodes utf8 encoded input. that is a sort of a choice but what else do you expect it to do? – David Carlisle Dec 11 '15 at 18:32
As I said, inputenc does look at the loaded font encodings. As there is nothing to do under xelatex, the Unicode being treated natively, I would expect inputenc to do the appropriate work compatible with this user choice of using T1 encoded fonts. The reason being that nobody else is doing that for the poor user ! – Dec 11 '15 at 18:35
@jfbu you know I wrote parts of this code:-) inputenc.sty and all "classic" input encodings such as latin1, do not look at the font encoding, the utf8 one is the only one that does, just to make an initial guess of how much of unicode to cover, – David Carlisle Dec 11 '15 at 18:38
@jfbu inputenc works by making all characters bigger than 127 active and defining them to expand to an appropriate (font encoding independant) command. For pdftex there are only 128 of those so it's feasible, for xetex there are Thousands of such characters so it is not feasible. (You can turn off xetex's unicode support and make it read a file bye by byte, but why use xetex then?) – David Carlisle Dec 11 '15 at 18:41
@jfbu: You should believe David and egreg. inputenc does not look at the font encoding. It maps e.g. ä to "a and absolutly don't care what fontenc does later with the "a. And it does not work with xelatex as xelatex and pdftex looks differently at the input. ä is one "token" with xelatex and two with pdftex. – Ulrike Fischer Dec 11 '15 at 18:41
No, there are not thousands of characters if you target the glyphs which happen to be present in the T1 encoding ... assuming most of them have Unicode code points: for each such codepoint I would make the character active and map it to the \char dd with dd the slot in the T1 encoding. – Dec 11 '15 at 18:43
@UlrikeFischer as David said, in the case ofutf8andpdftexengine,inputenc*does* look at the font encodings. But withxelatex` engine it does not do anything ! isn't this a wasted opportunity ? – Dec 11 '15 at 18:44
@jfbu inputenc never makes font encoding specific commands, you could make some characters active and map to LICR representation such as \'{a} but really why? – David Carlisle Dec 11 '15 at 18:45
@UlrikeFischer again, as with egreg I must repeat I am not looking at what inputenc does (which is "nothing at all") but what it could do. – Dec 11 '15 at 18:45
David, inputenc does nothing under xelatex. I am not asking it to do what it does in other contexts, which would be ill-advised. I am asking it to do something useful. As it looks at the user specified font encoding it could do something useful. But "it" does not want. !!! not my fault !!! – Dec 11 '15 at 18:47
@jfbu for what it could do look at https://github.com/davidcarlisle/dpc-inputenc-tests/ but one reason for not doing that is it is fragile, would need major changes in the hyphenation setup and really has almost no use cases as it is always simpler to use a suitable font (eg any opentype times clone rather than newtxtext) – David Carlisle Dec 11 '15 at 18:47
Thanks for the link! support for math fonts is limited, sometimes one has a good combination text+math, and xelatex has the potential to produce much smaller files than pdflatex. This is the core reason which has initiated my revolutionary ;-) question. – Dec 11 '15 at 18:51
I understand that inputenc job is to map characters to font independent commands. It does that with pdflatex+utf8 source. It then looks at the fontenc specified encodings. But with xelatex a user willing to use 8bit fonts is abandoned by inputenc. Surely, making characters active is not good, but becoming useless is not good either. LaTeX is always very keen on maintaining compatibility and works hard to be cross-engines. Legacy documents using 8bit fonts could be compiled via xelatex (which has superior font compression), but inputenc drops the user. – Dec 12 '15 at 11:25
@jfbu initially inputenc did not test for xetex so used the same code as pdftex, this scambled the document making it unusable. the test to allow utf8 through natively (added in 2014) allows basic use cases is not abandoning it is making things work. It is simply not possible to make everything that works with pdftex work transparently with xetex and luatex. xetex users expect \input{filenames with accents} to work, how do you propose to make that work with accented characters expanding to \'{e} ? before using emotive words like "abandon" and "drop" make a full working code proposal. – David Carlisle Dec 12 '15 at 11:58
I am not a native English speaker, hence can not guess if the words are use are too emotionally charged; "abandon" in my question title was too strong, I admit. The active characters expand depending on context already, thus \input could set the context. I appreciate all your comments. When I posted the question I was expecting answers like in your last comment, about points one can argue. The overall tone of the three answers I got was that the things are the way they are because that's what inputenc is. Little by little I (and others) benefit from more detailed argumentation by experts. – Dec 12 '15 at 12:12

Ulrike Fischer · Answer 3 · 2015-12-12T12:07:07.610

6

You are saying "it is not forbidden to use fontenc with xelatex." This is true. Actually fontenc is normally used with xelatex as fontspec loads fontenc but not with the T1 option but EU1.

fontenc is a rather special package which can be loaded more than once. In your question you are implicitly assuming that if T1 is loaded it is also the only, the main font encoding of the document. But this here is quite valid too:

\documentclass{article}
\usepackage[T1,LGR,LSF]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{fontspec} % calls \usepackage[EU1]{fontenc}
\begin{document}
abc
\end{document}

What should inputenc do here?

To expand the answer a bit: Documents can load various encodings through fontenc, sometimes even without that the user is aware of it or even wants, e.g. a local class or a (math) package could do it. It is even possible that inputenc is loaded behind the back of the user. It would give quite a mess if inputenc would implement some complicated heuristic to activate a number of chars -- something that normally xelatex users neither need nor want.

edited Dec 12 '15 at 12:07

answered Dec 11 '15 at 19:10

Ulrike Fischer

327,261

I don't know about LSF. But inputenc could see T1, make the corresponding unicode characters active, and it would see LGR and make the Greek letters Unicode codepoints active and mapped to the corresponding LGR slot. The utf8 letters ασδφγηξ... would use the suitable LGR slot. Is there some major obstacle to that ? – Dec 11 '15 at 19:15
Well the main obstacle is that the active encoding at the begin of the document is EU1 so why should the greek be mapped to LGR? Nothing says that LGR will ever be used. – Ulrike Fischer Dec 11 '15 at 19:19
I see. But it could be a convention to obey the order of the encodings. For example, the latest or earliest encoding wins, if the same utf8 letter can be mapped to various slots in distinct 8bit encoding. EU1 is special here. It seems fontenc has its arms a bit twisted here by fontspec. – Dec 11 '15 at 19:26
inputenc doesn't (and shouldn't) map to "slots". It maps to commands. It is the font encoding files which then map the commands to slots. Beside this: users are quite free to change the encoding in the document. The loading order is a quite weak indiz. If you make chars active you no longer can use them in command names, and it is more difficult to write them to files, so one shouldn't do it without good reasons. – Ulrike Fischer Dec 11 '15 at 19:45
I agree that avoiding active characters is good. And that using a Unicode engine with OpenType fonts cures all related problems. But the good reasons obviously existed for [latin1]{inputenc} as \meaning é now produces macro:->\@tabacckludge ’e. My context is one of a user using xelatex with 256 slots fonts. Why would inputenc then find insurmontable doing in such contexts making characters active. The current attitude is to do nothing: but this fails to serve the user in any way. – Dec 11 '15 at 20:44
Sorry but I'm handling a lot of documents and never had the need to use a T1-font as main font and to activate the whole T1-range in a xelatex documents. Also as David mentioned hyphenation will probably not be correct. So why should a base package bother with a rather special case? Why don't you use pdflatex? – Ulrike Fischer Dec 11 '15 at 21:08
To obtain a smaller pdf. xelatex is more efficient in compacting the included fonts. – Dec 11 '15 at 21:11
And, I never said I had to activate the whole T1 range. I think my use case does not currently go beyond the French guillemets. – Dec 11 '15 at 21:13
1

Well this inputenc certainly can't know. If it should do something in the presence of T1, it would have to activate everything. Every subset must be managed by the user individually anyway. – Ulrike Fischer Dec 11 '15 at 21:22
Would you consider using 8bit dvilatex and dvipdfmx instead? – jarnosc Dec 12 '15 at 05:38
@erreka indeed, I almost always try latex+dvipdfmx, and compile with it if possible. In the case at hand, the document used png images. I did not recall immediately which steps I should have followed hence I switched to xelatex for a try; and furthermore I control only indirectly the latex source as I produce it via Sphinx from ReST files, in the context of a project whose main target had been HTML+MathJax. And I also wanted to learn how to set up my conf.py in my Sphinx project so that make latex would produce xelatex usable source. – Dec 12 '15 at 11:10
@UlrikeFischer about your earlier comment about inputenc mapping to commands. Absolutely agreed, naturally, inputenc must assign characters to commands whose behavior will then depend on the locally specified font encoding. For pdflatex + utf8, inputenc looks at the font-encodings to decide the unicode range it must treat. But for xelatex, even in case the user has indicated its will to use 8bit fonts via fontenc, inputenc does nothing. This leaves the user in the dust. – Dec 12 '15 at 11:19
@jfbu You don't seem to get it: The fact that the document loads T1 doesn't say anything. Some (math) package or other package or some local class could do it without notice of the user. Or the user could do it to use some obscure single symbol. The probability that a xelatex user who loads T1 or LGR or T2A wants all related chars active is imho 1%, and so your suggestion would leave the other 99% in the dust. Write a special package for your special need. – Ulrike Fischer Dec 12 '15 at 11:46

Why does inputenc abandon so quickly under "utf8 based engines"?

3 Answers3

A proof of concept for a package `xeinputenc`

Linked

Related

Why does inputenc abandon so quickly under "utf8 based engines"?

3 Answers3

A proof of concept for a package xeinputenc

Linked

Related

A proof of concept for a package `xeinputenc`