Why can't "fi" be separated when being copied from a compiled pdf?

Question

I notice that in a pdf file compiled from Latex, "fi" such as in "field" cannot be separated as "f" and "i" when copying text out of the pdf file. I wonder why and if this can be changed? Thanks and regards!

TeX combines a number of single/separate characters into a single ligature. This is the case with fi. — Werner, Nov 02 '11 at 17:06
Why does it work this way and not consider later copying text out of compiled pdf correctly? — Tim, Nov 02 '11 at 17:07
Which font do you use? Could you add a minimal working example (MWE)? — doncherry, Nov 02 '11 at 17:11
@Werner: Do you have more information on what characters are combined by Tex/Latex? — Tim, Nov 02 '11 at 17:15
It sounds like the pdf file you're working with was compiled with a fairly ancient form of latex (or was first compiled to dvi, then to ps, then to pdf). I encounter this problem quite frequently with working papers I download from the Internet: if I try to extract some part (usually the abstract) as ascii text, quite frequently ligated characters (fi, ff, fl, ffi, and ffl) are represented as weird characters than as their constituent glyphs. If the pdf file in question comes from a person you know, you may be able to ask him/her to recompile the file with a newer version of (La)TeX. — Mico, Nov 02 '11 at 17:35
@Tim: I've added an "answer" containing some of the ligatures introduced by TeX, as included in the TeX Book. — Werner, Nov 02 '11 at 17:46
When you copy them into another document they will separate into individual cases- I just tested by copying out the ffi of officers from a PDF I made using PDFLaTeX. — Canageek, Nov 02 '11 at 18:06
@Canageek: It depends on the name used in the font for the "ffi" etc glyph. If the name is the standard (f_f_i) then intelligent pdf-readers like acrobat are able to separate the glyph in its parts. If a non-standard name is used (like e.g. in old versions of the cm-fonts) than the reader can not identify the glyph components. So how copy & paste works depends 1. on the font and 2. on the pdf-reader. — Ulrike Fischer, Nov 02 '11 at 18:32
This is a good question, and one I think most people have had. It appears that the problem depends both on what method you use (OT1 vs T1, pdflatex vs latex vs dvi-ps-latex) and what pdf viewer you use (I've made examples which copy fine from Apple's Preview app, but not from Adobe Acrobat). This seems like a good opportunity for someone who understands what is going on to write a more canonical answer. — Nathan Grigg, Nov 02 '11 at 22:26

score 34 · Answer 1 · edited Apr 13 '17 at 12:34

34

cmap package was written to solve this problem. From the documentation:

The cmap package is intended to make the PDF files generated by pdflatex "searchable and copyable" in acrobat reader and other compliant PDF viewers.

Just put the line \usepackage{cmap} at the beginning of your preamble, and it will pre-load the needed CMap translations for the fonts used in the document, provided that there exists the file .cmap for the font encoding.

Current version of the CMap package includes CMap files for the following LaTeX font encodings: T1, T2A, T2B, T2C, T5, OT1, OT1tt, OT6, LGR, LAE, LFE.

This works for me:

\documentclass{article}
\usepackage{cmap}
\usepackage[T1]{fontenc}
\begin{document}
final
\end{document}

Update: Ulrike Fisher in comments says that pdfglyphtounicode is better, see Make ligatures in Linux Libertine copyable (and searchable)

edited Apr 13 '17 at 12:34

Community

1

answered Nov 02 '11 at 17:11

Boris

38,129

7

I don't need cmap in this case (miktex 2.9, win7, Adobe 9) (I think it depends on the version of the fonts). Also cmap doesn't work with virtual fonts so it helps more or less only if the cm/ec-fonts are the problem. In other case pdfglyphtounicode is imho better. See http://tex.stackexchange.com/questions/4397/make-ligatures-in-linux-libertine-copyable-and-searchable – Ulrike Fischer Nov 02 '11 at 17:25
@ulrike-fisher: did not know about pdfglyphtounicode, thanks! – Boris Nov 02 '11 at 17:44
3

It's 2011: XeTeX and LuaTeX exist. These engines can handle Unicode input and OpenType fonts and don't need the cmap package. – Martin Schröder Nov 02 '11 at 18:03
1

@martin-schroder: if the original poster used xetex with otf, he would not ask his question. Since he asked it, he needed help with cm/ec. I do not think the answer `switch your work flow to xe/lua with different fonts' is too helpful – Boris Nov 02 '11 at 18:27
4

Boris: it might not help the OP, but this site (and this page) is not only for the OP but for the whole community. So I think @MartinSchröder's point is helpful to others. Perhaps the OP is at the beginning of a project and still looking for the right (2011) workflow? Who knows? – topskip Nov 02 '11 at 19:41
@Boris: I somewhere read that you should not only add \usepackage{cmap} to the preamble but also \input glyphtounicode and \pdfgentounicode=1. But I don't not know what these commands do and whether they help or change anything at all. Do you? – ClintEastwood Mar 04 '12 at 17:29

Werner · Accepted Answer · 2018-08-29T16:00:25.337

31

The following is taken "verbatim" from the TeX Book (Chapter 9 TeX's Roman Fonts, p 51):

Let's begin with the rules for the normal roman font (\rm or \tenrm); plain TeX will use this font for everything unless you specify otherwise. Most of the ordinary symbols that you need are readily available and you can type them in the ordinary way: There's nothing special about

the letters A to Z and a to z

the digits 0 to 9

common punctuation marks : ; ! ? ( ) [ ] ‘ ’ - * / . , @

except that TeX recognizes certain combinations as ligatures

ff yields ﬀ

fi yields ﬁ

fl yields ﬂ

ffi yields ﬃ

ffl yields ﬄ

-- yields – (an en-dash)

--- yields — (an em-dash)

‘‘ yields “

’’ yields ”

!‘ yields ¡

?‘ yields ¿

Of course, TeX writes ligatures for most of its accents as well, as in \^o. The best way to think about ligatures is that they represent a single character in a font. As such, MS Word's "Insert Symbol" dialog is probably a good representation of this:

Note how some of the symbols occur in a single box, implying that are "joined at the hip" so to speak, representing a single character (or ligature) in the typeset output. Additionally, this is font specific, with different fonts having different (more or less) ligatures.

edited Aug 29 '18 at 16:00

answered Nov 02 '11 at 17:39

Werner

603,163

So do modern versions of PDFLaTeX correctly separate display & the copy-able text? I know PDFs can have multiple layers of information for text, can you have it so that ff displays with a ligature, but copies? goes to check – Canageek Nov 02 '11 at 18:04
2

Yes, yes it does, though you cannot select part of ffi, you have to copy the whole thing, though it pasts into a text editor as 3 characters. – Canageek Nov 02 '11 at 18:05
@Canageek: in gedit, it isn't pasted into several characters. – Tim Nov 02 '11 at 18:10
3

@Tim First see Ulrike Fischer's comment on the main post- I think that is your PDF reader, not gedit. In my test notepad.exe, notepad++, emacs and Firefox all took it as 3 characters, so I think it is being fed to them that way- lmodern in T1, roman, was my font. – Canageek Nov 02 '11 at 19:54
@Tim, what are you using to display PDF? The step where the text is copied into the buffer is probably the critical one here. – vonbrand Aug 07 '15 at 14:39
The image is misleading. While some of the IPA symbols displayed originated as ligatures, still each of them is a single symbol denoting a single sound (an africate), and cannot be split into any constituent parts without altering the meaning. It would be akin to splitting a ”w” into “uu”. – Emil Jeřábek Aug 29 '18 at 13:51
@EmilJeřábek: That was not the intent. Regardless, I've updated the image to something less misleading. – Werner Aug 29 '18 at 16:00

Mico · Answer 3 · 2012-02-04T07:30:50.873

You wrote:

I notice that in a pdf file compiled from Latex, "fi" such as in "field" cannot be separated as "f" and "i" when copying text out of the pdf file. I wonder why and if this can be changed?

If it's already in the compiled pdf, there's not much you can do. Consider the following MWE:

\documentclass{standalone}
\usepackage[OT1]{fontenc}
\begin{document}
iffy fig flat office baffle
\end{document}

If you compile this program with pdflatex, you should get this:

enter image description here

Note the appearance of the five ligated characters. However, if you copy-and-paste the output (I'm using TeXLive 2011 and TeXworks as the front end as I'm writing this), you'll get:

iy g 
at oce bae

I'm afraid that the data entry system for this site doesn't seem to render the various weird symbols properly, so you'll have to trust me when I say I see a pair of musical notes, a masculine and a feminine gender symbol, and some unrecognizable shapes. (Interestingly, the "fl" glyph appears to be represented by an invisible newline character, hence the characters "at" show up on the second line.)

However, if I change "OT1" to "T1" (the more-modern font encoding scheme for English-language text -- also OK for many non-English languages) in the MWE, recompile it, and copy-and-paste the output from the resulting pdf file to an ascii editor, I get:

iffy fig flat office baffle

as one would hope to get, i.e., all five ligated character combinations are now recovered correctly.

I haven't repeated this experiment with other modern font encodings, but I suspect that the problems -- i.e., the ligated glyphs being rendered incorrectly when copied from the pdf file to a plain-text file -- are specific to the OT1 font encoding method. If you come across a pdf file that was created with latex and the ligature glyphs look all funny, and if the opportunity presents itself, you may wish to ask the paper's author if he/she might be willing to recompile it with the \usepackage[T1]{fontenc} instruction in the preamble...

This does not seem to help when extracting text with ghostscript, e.g., gs -sDEVICE=txtwrite [...], so the difference may be related to your PDF viewer. — tglas, Jun 12 '20 at 10:43

Bianka · Answer 4 · 2015-08-07T14:04:20.793

4

Just for the record: None of the above worked for me, but this thread:http://www.latex-community.org/forum/viewtopic.php?f=5&t=953

Precisely these lines:

 \usepackage{microtype}
    \DisableLigatures[f]{encoding=T1}

before the \usepackage[T1]{fontenc} line.

If all else fails this can be a quick-and-dirty solution.

Take care, B

edited Aug 07 '15 at 14:04

answered Aug 07 '15 at 10:04

Bianka

41

Welcome to TeX.SX! A screen shot would be nice and a working document too ;-) – Aug 07 '15 at 10:26
Disabling ligatures will give worse typesetting... – vonbrand Aug 07 '15 at 14:44

Why can't "fi" be separated when being copied from a compiled pdf?

4 Answers4

Linked

Related