What can cause generated PDF document whose text are not correctly copyable?

Question

I was wondering why some PDF documents compiled from some LaTeX code does not make its text correctly copyable, i.e. what you can get is nonsense text? For example, this pdf document (I don't have the LaTeX code for it).

Looks like http://tex.stackexchange.com/questions/11307/is-it-possible-to-produce-a-pdf-with-un-copyable-text to me — Joseph Wright, Jan 20 '12 at 08:20
@Joseph -- your link covers the same topic, but from a different angle, namely how can one make this happen. the present question is "why does it happen?" — barbara beeton, Jan 20 '12 at 14:06
This sort of nonsense text being copied from PDFs was very common with some old versions of Ghostscript when bitmap fonts were involved. Your example PDF is far from being the worst I've seen. — Philippe Goutet, Aug 15 '12 at 16:29

score 23 · Accepted Answer · edited Apr 13 '17 at 12:36

23

The PDF specification allows for this - displaying something different from what is retrievable. The randtext package is perhaps a prime example of this. It provides \randomize{<text>} which (from the package documentation)

...typesets a box that looks, on paper, like <text>, but whose letters have in fact been placed in random order so that they are not copiable from the file directly.

Another example of this provided by the accsupp package which allows one to specify an alternative text to be extracted than displayed using a grouping:

\BeginAccSupp{ActualText={<copy>}}<typeset>\EndAccSupp{}

For an example of this, see How to make text copy in PDF previewers ignore lineno line numbers?

The motivation for this is most certainly security, as is motivated from the randtext documentation:

The function of this odd macro is to obfuscate e-mail addresses, say on a PDF document put online, so that the human reader sees the address as expected, but e-mail address harvesters and spambots cannot determine the address.

edited Apr 13 '17 at 12:36

Community

1

answered Jan 20 '12 at 05:00

Werner

603,163

My test document created with \randomize from the randtext package does not seem to randomize anything for me. I can happily copy and paste with evince and also pdftotext returns all the typeset text. – Frederick Nord Feb 14 '13 at 21:05
@FrederickNord: PDF security, as it seems from numerous other comments I've read, is heavily reliant on the PDF viewer. – Werner Feb 14 '13 at 21:13
Hm @Werner. I don't know much about PDF, but rumour has it, that you can remove or replace the (easily) machine readable text but still have something that renders text nicely. So a human (or OCR) could indeed read the text while pdftotext could not. Do you happen to know anything about that? – Frederick Nord Feb 15 '13 at 01:11
@FrederickNord: For an entire text, or merely snippets? See, for example, How to redefine @ and . to obfuscate email addresses? and/or Obfuscation of @ and . in e-mail addresses for email-specific obfuscation. – Werner Feb 15 '13 at 06:35
I mean for an entire text. Pretty much like here but in a more pdfLaTeX way. – Frederick Nord Feb 15 '13 at 12:26
@FrederickNord: Haven't seen anything like that before... and what I've seen, is awesome. It seems like a one-time background fiddling with the font encoding, so it's not required with every run. – Werner Feb 15 '13 at 17:20
let us continue this discussion in chat – Frederick Nord Feb 15 '13 at 19:41

score 5 · Answer 2 · answered Jan 20 '12 at 07:36

Perhaps, more useful would be asking what to do to avoid such troubles. If so, then awswer is to load cmap package. It is absolute must if text contains Cyrillic letters. However, to the best of my knowledge, cmap is not required if source text is in UTF8 encoding.

What can cause generated PDF document whose text are not correctly copyable?

2 Answers2

Linked