The command \includegraphics[page=n,viewport=x y X Y,clip]{filename} extracts page n from filename.pdf and clips it to a rectangular viewport with corners at (x,y) and (X,Y). Coordinates are relative to the origin of the bounding box.
Since the clipping is arbitrary, performing it by mere deletion of the objects in the input document would be impossible. A combination of the two techniques would work fine, though (i.e. delete everything that does not intersect the viewport and visually clip the rest).
However, the practical and aestethic benefits of such a complicated procedure are nil, except in your peculiar use case. And, unfortunately, it looks like implementing it would be non trivial. For me, at least.
Let's imagine instead that you could wipe out all text from your input file: this clearly solves your problem. While I do realize that text of complicated figures (e.g. graphs labels) would disappear, you wrote about a logo, so I'm hoping this will be enough. In any case, adapting the following trick to spare some bits of text is not difficult at all.
The plan is dead simple:
- uncompress the pdf
- delete all text
- compress the pdf (optional, just a matter of good manners)
(Un)compressing the PDF
There are many tools for the job. I'm using pdftk, but you can choose anything you want. We just need to unwind the compression obfuscating raw PDF code. The syntax of the necessary console commands is crystal clear:
pdftk input.pdf output output.pdf uncompress
pdftk input.pdf output output.pdf compress
Mangling the PDF
This is the tricky part. If you read §5.3 Text objects from the PDF reference document you will see that text objects are delimited by a pair of unique operators, BT and ET (as in begin/end text) and cannot be nested. More or less, we just want to kill the three central lines in every occurence of something shaped like
... mysterious pdf code ...
BT
... text operators code ...
ET
... mysterious pdf code ...
How are going to do it? Regular expressions, of course.
There are some catches about the role of whitespace (§3.1 Lexical conventions) but I'm trying to simply circumnavigate them. Here is the black magic:
s/^BT.*?^ET//smg
This substitutes every match of ^BT.*?^ET with nothing at all, while
s: allowing . to match \n
m: allowing matches to span many lines
g: performing a global search (i.e. finding all matches)
The string ^BT.*?^ET can be broken down as
^BT: matches the operator opening a text object
^: match the start of a line
BT: match the string BT
.*?: matches the shortest possible string of characters including newlines
.: match anything (including \n thanks to the s flag)
*: repeat last match zero or more times
?: be lazy with the last match instead of greedy (i.e. choose the shortest instead of the longest)
^ET: matches the operator closing a text object
^: match the start of a line
ET: match the string ET
Here the crux is the laziness of the central pattern: it guarantees we are matching the correct closing operator.
I will use Perl to apply the regular expression, but again: pick your favourite tools for the job.
Wrapping it up in LaTeX
Now we put everything into a nice macro.
As you may have guessed by now, you will have to TeX your files using the --shell-escape option, at least at the first compilation, allowing the execution of shell commands.
Here we go:
\documentclass{article}
\usepackage{graphicx}
\newcommand\includesquelchedpdf[2][]
{\IfFileExists{./#2_squelched.pdf}
{\relax}
{\IfFileExists{./#2.pdf}
{\immediate\write18{ pdftk #2.pdf output - uncompress
| perl -0777 -pe 's/^BT.*?^ET//smg'
| pdftk - output #2_squelched.pdf compress }}
{\errmessage{Error: you tried to squelch a nonexistent PDF file}}}%
\includegraphics[#1]{#2_squelched}}
\begin{document}
\fbox{\includesquelchedpdf[page=1,viewport=190 500 400 620,clip]{lipsum_image}}
some text
\end{document}
I used some pipes to make the shell command leaner.
I also added some control structure to avoid choking on nonexistent files or reprocessing already squelched ones.
I think it's reasonable to affirm the output will behave as expected in every document viewer, as there are are simply no more text objects to select.
This works on the MWE. If your real life use case includes some small bits of text, we can probably work it out using some more black magic.
That was fun!
Update
@cfr pointed out to me in the comments that pdktk is no longer supported, so I propose an alternative tool that may be preferable. First, the code:
\documentclass{article}
\usepackage{graphicx}
\newcommand\includesquelchedpdf[2][]
{\IfFileExists{./#2_squelched.pdf}
{\relax}
{\IfFileExists{./#2.pdf}
{\immediate\write18{ qpdf -qdf #2.pdf -
| perl -0777 -pe 's/^BT.*?^ET//smg'
| fix-qdf > #2_squelched.pdf }}
{\errmessage{Error: you tried to squelch a nonexistent PDF file}}}%
\includegraphics[#1]{#2_squelched}}
\begin{document}
\setlength\fboxsep{0pt}
\fbox{\includesquelchedpdf[page=1,viewport=190 500 400 620,clip]{lipsum_image}}
some text
\end{document}
I am using qpdf (fix-qdf is part of the bundle). This offers some advantages:
qpdf is very much alive (last version is 5.1.3, released May 24, 2015), open source and has no pro paid version;
- it features the QDF mode, designed exactly to unbundle and manipulate PDF files as text (that's what we are doing);
- if has the tool
fix-qdf to repair possible damages deriving from the process;
- it guarantees our regular expressions to always work (see §4 of its manual).
All in all, this is probably preferable. I'll leave both of them here, as I know them too superficially to make a really fair comparison in terms of features.