I notice that in a pdf file compiled from Latex, "fi" such as in "field" cannot be separated as "f" and "i" when copying text out of the pdf file. I wonder why and if this can be changed? Thanks and regards!
4 Answers
cmap package was written to solve this problem. From the documentation:
The cmap package is intended to make the PDF files generated by pdflatex "searchable and copyable" in acrobat reader and other compliant PDF viewers.
Just put the line \usepackage{cmap} at the beginning of your preamble, and it will pre-load the needed CMap translations for the fonts used in the document, provided that there exists the file .cmap for the font encoding.
Current version of the CMap package includes CMap files for the following LaTeX font encodings: T1, T2A, T2B, T2C, T5, OT1, OT1tt, OT6, LGR, LAE, LFE.
This works for me:
\documentclass{article}
\usepackage{cmap}
\usepackage[T1]{fontenc}
\begin{document}
final
\end{document}
Update: Ulrike Fisher in comments says that pdfglyphtounicode is better, see Make ligatures in Linux Libertine copyable (and searchable)
-
7I don't need cmap in this case (miktex 2.9, win7, Adobe 9) (I think it depends on the version of the fonts). Also cmap doesn't work with virtual fonts so it helps more or less only if the cm/ec-fonts are the problem. In other case pdfglyphtounicode is imho better. See http://tex.stackexchange.com/questions/4397/make-ligatures-in-linux-libertine-copyable-and-searchable – Ulrike Fischer Nov 02 '11 at 17:25
-
-
3It's 2011:
XeTeXandLuaTeXexist. These engines can handle Unicode input and OpenType fonts and don't need thecmappackage. – Martin Schröder Nov 02 '11 at 18:03 -
1@martin-schroder: if the original poster used xetex with otf, he would not ask his question. Since he asked it, he needed help with cm/ec. I do not think the answer `switch your work flow to xe/lua with different fonts' is too helpful – Boris Nov 02 '11 at 18:27
-
4Boris: it might not help the OP, but this site (and this page) is not only for the OP but for the whole community. So I think @MartinSchröder's point is helpful to others. Perhaps the OP is at the beginning of a project and still looking for the right (2011) workflow? Who knows? – topskip Nov 02 '11 at 19:41
-
@Boris: I somewhere read that you should not only add
\usepackage{cmap}to the preamble but also\input glyphtounicodeand\pdfgentounicode=1. But I don't not know what these commands do and whether they help or change anything at all. Do you? – ClintEastwood Mar 04 '12 at 17:29
The following is taken "verbatim" from the TeX Book (Chapter 9 TeX's Roman Fonts, p 51):
Let's begin with the rules for the normal roman font (
\rmor\tenrm); plain TeX will use this font for everything unless you specify otherwise. Most of the ordinary symbols that you need are readily available and you can type them in the ordinary way: There's nothing special about
- the letters
AtoZandatoz- the digits
0to9- common punctuation marks
:;!?()[]‘’-*/.,@except that TeX recognizes certain combinations as ligatures
ffyieldsfffiyieldsfiflyieldsflffiyieldsffifflyieldsffl--yields–(an en-dash)---yields—(an em-dash)‘‘yields“’’yields”!‘yields¡?‘yields¿
Of course, TeX writes ligatures for most of its accents as well, as in \^o. The best way to think about ligatures is that they represent a single character in a font. As such, MS Word's "Insert Symbol" dialog is probably a good representation of this:
Note how some of the symbols occur in a single box, implying that are "joined at the hip" so to speak, representing a single character (or ligature) in the typeset output. Additionally, this is font specific, with different fonts having different (more or less) ligatures.
- 603,163
-
So do modern versions of PDFLaTeX correctly separate display & the copy-able text? I know PDFs can have multiple layers of information for text, can you have it so that ff displays with a ligature, but copies? goes to check – Canageek Nov 02 '11 at 18:04
-
2Yes, yes it does, though you cannot select part of ffi, you have to copy the whole thing, though it pasts into a text editor as 3 characters. – Canageek Nov 02 '11 at 18:05
-
-
3@Tim First see Ulrike Fischer's comment on the main post- I think that is your PDF reader, not gedit. In my test notepad.exe, notepad++, emacs and Firefox all took it as 3 characters, so I think it is being fed to them that way- lmodern in T1, roman, was my font. – Canageek Nov 02 '11 at 19:54
-
@Tim, what are you using to display PDF? The step where the text is copied into the buffer is probably the critical one here. – vonbrand Aug 07 '15 at 14:39
-
The image is misleading. While some of the IPA symbols displayed originated as ligatures, still each of them is a single symbol denoting a single sound (an africate), and cannot be split into any constituent parts without altering the meaning. It would be akin to splitting a ”w” into “uu”. – Emil Jeřábek Aug 29 '18 at 13:51
-
@EmilJeřábek: That was not the intent. Regardless, I've updated the image to something less misleading. – Werner Aug 29 '18 at 16:00
You wrote:
I notice that in a pdf file compiled from Latex, "fi" such as in "field" cannot be separated as "f" and "i" when copying text out of the pdf file. I wonder why and if this can be changed?
If it's already in the compiled pdf, there's not much you can do. Consider the following MWE:
\documentclass{standalone}
\usepackage[OT1]{fontenc}
\begin{document}
iffy fig flat office baffle
\end{document}
If you compile this program with pdflatex, you should get this:

Note the appearance of the five ligated characters. However, if you copy-and-paste the output (I'm using TeXLive 2011 and TeXworks as the front end as I'm writing this), you'll get:
iy g
at oce bae
I'm afraid that the data entry system for this site doesn't seem to render the various weird symbols properly, so you'll have to trust me when I say I see a pair of musical notes, a masculine and a feminine gender symbol, and some unrecognizable shapes. (Interestingly, the "fl" glyph appears to be represented by an invisible newline character, hence the characters "at" show up on the second line.)
However, if I change "OT1" to "T1" (the more-modern font encoding scheme for English-language text -- also OK for many non-English languages) in the MWE, recompile it, and copy-and-paste the output from the resulting pdf file to an ascii editor, I get:
iffy fig flat office baffle
as one would hope to get, i.e., all five ligated character combinations are now recovered correctly.
I haven't repeated this experiment with other modern font encodings, but I suspect that the problems -- i.e., the ligated glyphs being rendered incorrectly when copied from the pdf file to a plain-text file -- are specific to the OT1 font encoding method. If you come across a pdf file that was created with latex and the ligature glyphs look all funny, and if the opportunity presents itself, you may wish to ask the paper's author if he/she might be willing to recompile it with the \usepackage[T1]{fontenc} instruction in the preamble...
- 506,678
-
This does not seem to help when extracting text with ghostscript, e.g.,
gs -sDEVICE=txtwrite [...], so the difference may be related to your PDF viewer. – tglas Jun 12 '20 at 10:43
Just for the record: None of the above worked for me, but this thread:http://www.latex-community.org/forum/viewtopic.php?f=5&t=953
Precisely these lines:
\usepackage{microtype}
\DisableLigatures[f]{encoding=T1}
before the \usepackage[T1]{fontenc} line.
If all else fails this can be a quick-and-dirty solution.
Take care, B
- 41

fi. – Werner Nov 02 '11 at 17:06