Extract image/vector art from PDF, encore

Question

Follow-up question to Extracting image from PDF to use in LaTeX document?

There is a logo in a beamer presentation that I need to reuse. I followed Herbert's answer to include the logo in my PDF and preserve the vector graphics. It works perfectly; the only issue is that if I select the text in the new PDF, the rest of the original slide becomes visible. Is there a way to avoid this?

Here's a minimal example... Consider the following that produces a 2-page PDF called lipsum_image.pdf:

enter image description here

\documentclass{article}
\usepackage{graphicx,lipsum}

\begin{document}
\centering
\includegraphics[width=.8\linewidth]{example-image}

\lipsum[1-3]
\end{document}

Now include the following viewport (clipped) version of lipsum_image.pdf in another document:

enter image description here

\documentclass{article}
\usepackage{graphicx}
\begin{document}
\fbox{\includegraphics[page=1,viewport=190 500 400 620,clip]{lipsum_image}}

some text
\end{document}

Can you include the image to your post? I think this could help to understnd better what is going on. — Ruben, Jul 21 '15 at 19:02
I didn't answer since it look like a bug, and this need more attention from experts here. Now, for your use you can extract the image using standalone class first and than use it. — touhami, Jul 23 '15 at 18:26
Do you require that this is solved only with LaTeX? Because the best solution would be opening the pdf in an external editor, as in the answer by @JaredKulik — phollox, Jul 26 '15 at 20:12
@phollox No, I do not require that. The answer by JaredKulik solved my problem. — user44413, Jul 26 '15 at 20:54
Glad it was helpful. Please remember to accept an answer if you're satisfied with it. — Jared Kulik, Jul 26 '15 at 21:35
@JaredKulik I voted for it. I was about to type the same thing but you beat me. The +50 bounty was awarded to another answer. Sorry bro — phollox, Jul 26 '15 at 21:58

score 6 · Answer 1 · edited Apr 13 '17 at 12:35

This is by no means a solution. It's just a really dirty workaround, but it might be good enough if you are in a hurry.

\documentclass{article}
\usepackage{graphicx}

\usepackage{accsupp}
\newcommand\squelchgraphics[2][]{%
  \BeginAccSupp{method=plain,ActualText={}}\includegraphics[#1]{#2}\EndAccSupp{}}

\begin{document}
\fbox{\squelchgraphics[page=1,viewport=190 500 400 620,clip]{lipsum_image}}

some text
\end{document}

We are using an experimental package for accessiblility support to replace the selection with a blank alternate text.

evince and zathura are unable to select the text outside the viewport, just as we'd like. firefox and chrome can still select some blank boxes, though. I can't guarantee for other document viewers.

Inspired by this answer.

Paolo Brasolin · Answer 2 · 2015-07-26T16:00:18.017

The command \includegraphics[page=n,viewport=x y X Y,clip]{filename} extracts page n from filename.pdf and clips it to a rectangular viewport with corners at (x,y) and (X,Y). Coordinates are relative to the origin of the bounding box.

Since the clipping is arbitrary, performing it by mere deletion of the objects in the input document would be impossible. A combination of the two techniques would work fine, though (i.e. delete everything that does not intersect the viewport and visually clip the rest).

However, the practical and aestethic benefits of such a complicated procedure are nil, except in your peculiar use case. And, unfortunately, it looks like implementing it would be non trivial. For me, at least.

Let's imagine instead that you could wipe out all text from your input file: this clearly solves your problem. While I do realize that text of complicated figures (e.g. graphs labels) would disappear, you wrote about a logo, so I'm hoping this will be enough. In any case, adapting the following trick to spare some bits of text is not difficult at all.

The plan is dead simple:

uncompress the pdf
delete all text
compress the pdf (optional, just a matter of good manners)

(Un)compressing the PDF

There are many tools for the job. I'm using pdftk, but you can choose anything you want. We just need to unwind the compression obfuscating raw PDF code. The syntax of the necessary console commands is crystal clear:

pdftk input.pdf output output.pdf uncompress
pdftk input.pdf output output.pdf compress

Mangling the PDF

This is the tricky part. If you read §5.3 Text objects from the PDF reference document you will see that text objects are delimited by a pair of unique operators, BT and ET (as in begin/end text) and cannot be nested. More or less, we just want to kill the three central lines in every occurence of something shaped like

... mysterious pdf code ...
BT
... text operators code ...
ET
... mysterious pdf code ...

How are going to do it? Regular expressions, of course. There are some catches about the role of whitespace (§3.1 Lexical conventions) but I'm trying to simply circumnavigate them. Here is the black magic:

s/^BT.*?^ET//smg

This substitutes every match of ^BT.*?^ET with nothing at all, while

s: allowing . to match \n
m: allowing matches to span many lines
g: performing a global search (i.e. finding all matches)

The string ^BT.*?^ET can be broken down as

^BT: matches the operator opening a text object
- ^: match the start of a line
- BT: match the string BT
.*?: matches the shortest possible string of characters including newlines
- .: match anything (including \n thanks to the s flag)
- *: repeat last match zero or more times
- ?: be lazy with the last match instead of greedy (i.e. choose the shortest instead of the longest)
^ET: matches the operator closing a text object
- ^: match the start of a line
- ET: match the string ET

Here the crux is the laziness of the central pattern: it guarantees we are matching the correct closing operator.

I will use Perl to apply the regular expression, but again: pick your favourite tools for the job.

Wrapping it up in LaTeX

Now we put everything into a nice macro. As you may have guessed by now, you will have to TeX your files using the --shell-escape option, at least at the first compilation, allowing the execution of shell commands.

Here we go:

\documentclass{article}
\usepackage{graphicx}

\newcommand\includesquelchedpdf[2][]
  {\IfFileExists{./#2_squelched.pdf}
    {\relax}
    {\IfFileExists{./#2.pdf}
      {\immediate\write18{ pdftk #2.pdf output - uncompress
                         | perl -0777 -pe 's/^BT.*?^ET//smg'
                         | pdftk - output #2_squelched.pdf compress }}
      {\errmessage{Error: you tried to squelch a nonexistent PDF file}}}%
  \includegraphics[#1]{#2_squelched}}

\begin{document}
\fbox{\includesquelchedpdf[page=1,viewport=190 500 400 620,clip]{lipsum_image}}

some text
\end{document}

I used some pipes to make the shell command leaner. I also added some control structure to avoid choking on nonexistent files or reprocessing already squelched ones.

I think it's reasonable to affirm the output will behave as expected in every document viewer, as there are are simply no more text objects to select.

This works on the MWE. If your real life use case includes some small bits of text, we can probably work it out using some more black magic.

That was fun!

Update

@cfr pointed out to me in the comments that pdktk is no longer supported, so I propose an alternative tool that may be preferable. First, the code:

\documentclass{article}
\usepackage{graphicx}

\newcommand\includesquelchedpdf[2][]
  {\IfFileExists{./#2_squelched.pdf}
    {\relax}
    {\IfFileExists{./#2.pdf}
      {\immediate\write18{ qpdf -qdf #2.pdf -
                         | perl -0777 -pe 's/^BT.*?^ET//smg'
                         | fix-qdf > #2_squelched.pdf }}
      {\errmessage{Error: you tried to squelch a nonexistent PDF file}}}%
  \includegraphics[#1]{#2_squelched}}

\begin{document}
\setlength\fboxsep{0pt}
\fbox{\includesquelchedpdf[page=1,viewport=190 500 400 620,clip]{lipsum_image}}

some text
\end{document}

I am using qpdf (fix-qdf is part of the bundle). This offers some advantages:

qpdf is very much alive (last version is 5.1.3, released May 24, 2015), open source and has no pro paid version;
it features the QDF mode, designed exactly to unbundle and manipulate PDF files as text (that's what we are doing);
if has the tool fix-qdf to repair possible damages deriving from the process;
it guarantees our regular expressions to always work (see §4 of its manual).

All in all, this is probably preferable. I'll leave both of them here, as I know them too superficially to make a really fair comparison in terms of features.

You know you can edit an answer, right? ;) It is a shame that pdftk is no longer supported - there really is no good replacement for it. (Nothing which can function as the toolbox pdftk provides.) Already it is unavailable from major distros. — cfr, Jul 26 '15 at 14:56
@cfr Yes, I do know. Actually I was I little hesitant, but I reread some guidelines on Meta before posting and a second answer seemed legitimate, since the two approaches are so different. Just tell me if I'm mistaken. :) — Paolo Brasolin, Jul 26 '15 at 15:28
I maybe tend to combine answers I shouldn't. I'm never certain about these things... ;). — cfr, Jul 26 '15 at 15:38
@cfr Also, thanks: I didn't notice the status of pdftk. I have added a better (?) alternative to the answer. — Paolo Brasolin, Jul 26 '15 at 16:01

Jared Kulik · Answer 3 · 2015-07-26T19:29:27.267

When the logo is stored as Vector Art in the pdf (The case in OP), you can use Inkscape to do the job:
1. Import the pdf into inkscape.
2. Select the object that appears, right-click and select "ungroup" as many times as needed to disentangle the bit you want from all the others.
3. Delete everything except the desired graphic.
4. Go into "Document Properties" and hit "Resize Page to Content" to fit the canvas dimensions to the size of your art piece.
5. Save it as a PDF, for use with \includegraphics in your document.

Note: You can crop the pdf with Krop or another tool, but this only modifies the visible "viewport" on the pdf, part of its metadata, and doesn't actually remove any extraneous content.

When the logo is stored as a Raster Image inside a Vector pdf (not scanned pages), you can use pdfimages to extract all images from a given page automatically. The quality is limited by the resolution of the original, but will be no worse.
When the entire pdf is nothing but a scanned image of a whole page, you can use pdfimages or pdftopng (from the xpdf package) to extract the whole page and then crop it with your favorite image editing program.

Be sure to observe Copyrights, get permission, etc'.

Extract image/vector art from PDF, encore

3 Answers3

(Un)compressing the PDF

Mangling the PDF

Wrapping it up in LaTeX

Update