14

I was wondering why some PDF documents compiled from some LaTeX code does not make its text correctly copyable, i.e. what you can get is nonsense text? For example, this pdf document (I don't have the LaTeX code for it).

diabonas
  • 25,784
Tim
  • 5,763
  • 2
    Looks like http://tex.stackexchange.com/questions/11307/is-it-possible-to-produce-a-pdf-with-un-copyable-text to me – Joseph Wright Jan 20 '12 at 08:20
  • 2
    @Joseph -- your link covers the same topic, but from a different angle, namely how can one make this happen. the present question is "why does it happen?" – barbara beeton Jan 20 '12 at 14:06
  • 1
    This sort of nonsense text being copied from PDFs was very common with some old versions of Ghostscript when bitmap fonts were involved. Your example PDF is far from being the worst I've seen. – Philippe Goutet Aug 15 '12 at 16:29

2 Answers2

23

The PDF specification allows for this - displaying something different from what is retrievable. The randtext package is perhaps a prime example of this. It provides \randomize{<text>} which (from the package documentation)

...typesets a box that looks, on paper, like <text>, but whose letters have in fact been placed in random order so that they are not copiable from the file directly.

Another example of this provided by the accsupp package which allows one to specify an alternative text to be extracted than displayed using a grouping:

\BeginAccSupp{ActualText={<copy>}}<typeset>\EndAccSupp{}

For an example of this, see How to make text copy in PDF previewers ignore lineno line numbers?

The motivation for this is most certainly security, as is motivated from the randtext documentation:

The function of this odd macro is to obfuscate e-mail addresses, say on a PDF document put online, so that the human reader sees the address as expected, but e-mail address harvesters and spambots cannot determine the address.

Werner
  • 603,163
5

Perhaps, more useful would be asking what to do to avoid such troubles. If so, then awswer is to load cmap package. It is absolute must if text contains Cyrillic letters. However, to the best of my knowledge, cmap is not required if source text is in UTF8 encoding.