Small-caps, old-style numbers, and some ligatures produce odd symbols in PDF copy text?

Question

This is a problem I've had consistently with XeLaTeX. Here's a small example:

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Ligatures=TeX,Numbers={OldStyle,Proportional}]{Minion Pro}

\begin{document}
The \textsc{nato} office 1234
\end{document}

The file compiles and displays just fine. But if I try to copy its text from Acrobat to another application, I get garbage characters in place of all small-caps, old-style numbers, and the ligature Th:

e  office 

This also means that if I search the PDF, there are no hits for "The", "NATO", or "1234". And God help me if the file gets indexed by Google.

Now if I switch to LuaLaTeX, I don't have this problem:

The \textsc{nato} office 1234

The nato office 1234

Then if I use uppercase small-caps, I can get exactly what I want:

The \mbox{\addfontfeatures{Letters=UppercaseSmallCaps}NATO} office 1234

The NATO office 1234

But try as I might with XeLaTeX, it still gives the same garbled copy text. The problem seems to be that in the resulting PDF file, LuaTeX generates "correct" CMap entries while XeTeX generates bugged ones which reference the Private Use Area:

LuaTeX                                        XeTeX
--------------------------------------------- ---------------------------------------------

/CIDInit /ProcSet findresource begin          /CIDInit /ProcSet findresource begin
12 dict begin                                 12 dict begin
begincmap                                     begincmap
...                                           ...
1 begincodespacerange                         1 begincodespacerange
<0000> <FFFF>                                 <0000> <FFFF>
endcodespacerange                             endcodespacerange
0 beginbfrange                                5 beginbfchar
endbfrange                                    <0044> <0063>
13 beginbfchar                                <0046> <0065>
<0044> <0063>                                 <0050> <006F>
<0046> <0065>                                 <0107> <E062>
<0050> <006F>                                 <010F> <FB03>
<0107> <00540068>                             endbfchar
<010F> <FB03>                                 1 beginbfrange
<015D> <0031>                                 <015D> <0160> <F731>
<015E> <0032>                                 endbfrange
<015F> <0033>                                 4 beginbfchar
<0160> <0034>                                 <05C9> <E000>
<05C9> <0041>                                 <05D6> <E044>
<05D6> <004E>                                 <05D7> <E049>
<05D7> <004F>                                 <05DC> <E061>
<05DC> <0054>                                 endbfchar
endbfchar                                     endcmap
endcmap                                       CMapName currentdict /CMap defineresource pop
CMapName currentdict /CMap defineresource pop end
end                                           end
end

Indeed, if I just "transplant" all these CMap entries from LuaTeX's PDF into XeTeX's PDF, the problem goes away! So:

Is there a way to make XeTeX emulate the way LuaTeX makes CMap entries?
If not, is there a way to edit the font to remove any references to the Private Use Area which may "tempt" XeTeX to make these incorrect CMap entries?
If none of that is possible, is there a way to edit the CMap entries in PDFs made by XeTeX, automatically and without needing a LuaTeX run?

EDIT: Since Jörg asked, here's how I did the "CMap transplant":

Make a file test.tex with the following contents:

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Ligatures=TeX,Numbers={OldStyle,Proportional}]{Minion Pro}
\begin{document}
The \mbox{\addfontfeatures{Letters=UppercaseSmallCaps}NATO} office 1234
\end{document}

Compile it with xelatex and lualatex and uncompress each PDF with pdftk:

xelatex test
mv test.pdf test_x.pdf
lualatex test
mv test.pdf test_l.pdf
pdftk test_x.pdf output test_xu.pdf uncompress
pdftk test_l.pdf output test_lu.pdf uncompress

Open each uncompressed PDF file in a text editor and look for the sections above. Delete the section from XeTeX's PDF and replace it with the equivalent section from LuaTeX's PDF (from begincodespacerange to endcmap).
The PDF is corrupted now, but pdftk will fix it:
```
pdftk test_xu.pdf output test_xf.pdf
```

Now test_xf.pdf will have the correct copy text. This is a neat proof-of-concept but it's useless for several reasons:

You have to compile in XeTeX and LuaTeX. If my files all compiled in LuaTeX then I'd just use LuaTeX and be done with it. That's the ideal solution here anyways.
You can't make a "key file" that compiles in LuaTeX and then just put its CMap into every PDF you make using XeTeX, since XeTeX appears to assign random input codes (e.g., <0044>, <0046>, <05C9>) every time the file is changed. Transplanting a CMap which uses a different set of input codes results in no characters being selectable.
If you use more than one font, you need to fix the CMap for each one. This applies even to the same typeface in two different optical sizes.

So the only way I can see this work is if someone made a program which had a built-in list of fonts and their problematic PUA references, which would go through every CMap section in a PDF line-by-line, identify the PUA references by their targets, and then change the targets according to the list. But that seems far too much work and at a certain point you just have to migrate to LuaTeX.

Interesting. According to the fontspec manual (p.20) "the use of such features does not aﬀect the underlying text", so it should give you at least lower case letters using XeLaTeX. Have you had a look at this question? http://tex.stackexchange.com/questions/18483/is-it-possible-to-provide-alternative-text-to-use-when-copying-text-from-the-pdf — Jörg, Mar 26 '12 at 13:39
@Jörg Yes, I've thought about using accsupp but with XeTeX it would require wrapping all numbers and ligatures in a macro. The manual also suggests that it could transpose characters, which would be very bad with numbers. I'd prefer to do everything with cmap magic if possible. — Chel, Mar 26 '12 at 13:54
I agree completely when it comes to numbers, but I think it is beyond fontspec's capability for small caps "nato" to become "NATO", and you would have to use accsupp for that. Having said that, the missing characters when using fontspec and xelatex appears to be a separate issue (one I am also interested in). — Jörg, Mar 26 '12 at 13:59
This question popped up now and then in the xetex mailing list. And as far as I remember no good solution (apart from using accsupp) has ever been offered. cmap magic seems to be too font dependant to work. — Ulrike Fischer, Mar 26 '12 at 15:35
@UlrikeFischer: But why does fontspec specifically state that copying should not be an issue. I continue the quote from the manual: "the lowercase letters are still stored in the document; only the appearance has been changed by the OpenType feature. This makes it possible to search and copy text without diﬃculty". Do I misunderstand that paragraph completely? — Jörg, Mar 26 '12 at 18:14
After reading through this thread on the mailing list (http://www.mail-archive.com/xetex@tug.org/msg00520.html) it appears that I might have to switch to luatex... — Jörg, Mar 26 '12 at 19:13
I can confirm this with plain-xetex (i.e., without fontspec or any other package or anything) — morbusg, Apr 03 '12 at 10:39
I still have this question open, which is a potential solution, though it would jack up the file size: http://tex.stackexchange.com/questions/33490/placing-the-un-ligatured-text-in-the-ocr-layer — Canageek, Apr 03 '12 at 14:44
just a comment here: I observe the same phenomenon with pdflatex and the kpfonts fonts. — pluton, May 31 '12 at 13:45
@pluton That's odd. Have you tried loading the cmap package? Post a question (MWE) if that doesn't help. — Chel, May 31 '12 at 14:30
I am also using the cmap package with xelatex and it is not working. I have removed all ligatures and oldstyle numbering as a work around... :( BTW. I am using an opentype font(Arno Pro) — Dan, Jun 01 '12 at 10:20
well, my answer was deleted... oh well, for anyone else attempting to solve this, this link provides some extra info. http://tug.org/pipermail/xetex/2009-August/013820.html — Dan, Jun 02 '12 at 13:40
@Dan Interesting read. I tried compiling a document with XeLaTeX and Arno Pro, a very new Adobe font, but the problem is still there. rdhs: could you please explain how you "transplant" a cmap table from luatex to xetex? — Jörg, Jun 06 '12 at 20:07
I just realised that I gave you the answer to 2): Even if a font does not have PUA encodings (Arno Pro), the cmap is still incorrect and we get incorrect mapping. Hence, Jonathan Kews (see the link from @Dan) was incorrect in assuming that removing PUA encoding would solve the issue. — Jörg, Jun 06 '12 at 20:31
@Jörg Well that's frustrating. Are the entries coming just from XeTeX on its own then? — Chel, Jun 06 '12 at 21:26
I really don't know, but given that new Adobe fonts don't have PUA encodings for small caps and the issue is still there means that this is not the problem, although many people on the XeTeX mailing list claim that it is. — Jörg, Jun 08 '12 at 09:51
I can confirm the problem with TL 2012, but if I compile the example with TL 2014 the problem seems to be gone (stumbled upon the question in context of http://tex.stackexchange.com/q/238260). Is it possible that the problem has been resolved by some changes in XeLaTeX? — peregrinustyss, Apr 14 '15 at 19:05
@peregrinustyss TL 2016 (at least on windows) seems to show the problem again though... — Joe, May 03 '17 at 19:29
As of TL2018, I still have this problem with OldStyle numbers. So is there a solution in sight? — LaTechneuse, Jun 19 '18 at 16:54

Jörg · Answer 1 · 2012-06-08T10:43:58.483

A partial answer to your second question regarding PUA encodings:

You can remove PUA encodings with ttx, a tool that is included in the Adobe Font Development Kit for OpenType (AFDKO).

However, new fonts such as Arno Pro do not have PUA encodings for smallcaps at all, but compiling an document such as yours still results in small caps not being searchable. Hence, the notion that Jonathan Kews states here is unfortunately wrong. The fault are not PUA encodings, but faulty cmaps generated by XeTeX.

Having said that, in your question you state that you can transplant a correct cmap generated by LuaTeX to a faulty map generated by XeTeX. Maybe a workaround would be to create a correct cmap for every possible glyph and then use that for XeTeX. That would be completely font (and font revision) dependent, of course, but it should work for your private purposes.

Unfortunately I cannot try that as I don't know how to "transplant" a cmap. Could you please elaborate how you did that?

Edit: And I just realised that everything works perfectly with Junicode and EB Garamond (when you specify the SC font separately), i.e. something like:

\documentclass{article}

\usepackage{fontspec}
\setmainfont{Junicode}

\begin{document}

{\addfontfeature{Letters=UppercaseSmallCaps}DIE STRAßE IST ZU SCHMAL FÜR AUTOS.

{\addfontfeatures{Numbers=OldStyle}12345}

\end{document}

compiles perfectly with searchable PDF. Maybe I give up on trying to understand what is going on...

See my edited question for details on the "CMap transplant." — Chel, Jun 08 '12 at 23:44

Small-caps, old-style numbers, and some ligatures produce odd symbols in PDF copy text?

1 Answers1

Linked

Related