17

I am aware that a compiled PDF typically cannot be decompiled back into LaTeX without using workarounds. However, I am worried about portions of the information available in LaTeX still being retrievable from the PDF.

Specifically, a situation like the following:

\documentclass{article}

\def\foopseudonym{bar}

\begin{document}

The pseudonym for the first respondent is \foopseudonym

\end{document}

I want to use \foopsuedonym to keep track of what pseudonyms refer to whom when writing my article, but I don't want anyone to be able to find the name of my defined command by decompiling the PDF. Do I need to worry?

  • 2
    As I understand it, yes. Unless you do something special, the markup doesn't make it into the output. If it did, the problem of accessible PDFs would not be a problem. TeX basically arranges boxes according to algorithms and outputs the boxes when it has complete pages. You have to make an effort if you want the result to include anything else. Other engines are a little different, but they're still using the basic TeX innards. Have you tried checking with e.g. pdfgrep? You can always inspect the PDF to see what's there. – cfr Sep 14 '23 at 11:36
  • 2
    @cfr thanks for the comment. I just realised the title of my question was the inverse of the body ("can names be retrieved" vs. "am I safe"), so I edited it to be consistent, if you want your comment to match that too. – Toivo Säwén Sep 14 '23 at 11:45
  • 4
    For the benefit of other visitors: The question was inverted (yes=no) but the comment may or may not be inverted. Read carefully – rallg Sep 14 '23 at 23:50
  • 1
    Related: Is the filename of an image preserved in the final PDF?. A couple of answers, including mine, demonstrate some tests that might also be applicable here. – Chris H Sep 15 '23 at 13:15

3 Answers3

15

Well it depends. It is possible to grab source code and to add to the PDF. As a proof of concept (this requires a current LaTeX):

\DocumentMetadata{uncompress,testphase={phase-III,math}}
\documentclass{article}

\def\foopseudonym{bar}

\begin{document}

The pseudonym for the first respondent is

$ \foopseudonym = x $

\end{document}

If you open the PDF you will see in it:

stream
LaTeX formula starts \begin {math} \foopseudonym = x \end {math} LaTeX formula ends 
endstream
Ulrike Fischer
  • 327,261
  • 1
    Interesting. But I suppose this would have to be done actively by the user, or alternatively passively by the compiler of choice? – Toivo Säwén Sep 14 '23 at 11:46
  • 6
    @ToivoSäwén This is opt-in: it's going to be needed for making tagged PDFs – Joseph Wright Sep 14 '23 at 12:05
  • 7
    Certainly this must be done by some code, but a normal user can't easily know if this happens in some package. – Ulrike Fischer Sep 14 '23 at 12:09
  • 4
    So is there a way for a normal user to be certain this information isn't in the PDF? There are all kinds of reasons you might need to do this e.g. anonymous submissions, anonymising student names etc. – cfr Sep 14 '23 at 13:45
  • 3
    @cfr it is not trivial to be certain. There are tools that allows you to inspect the pdf but you would need to know all places that are perhaps relevant. But do not get too worried: apart from the tagging code and the embedfile package I don't know of some latex package which actually embeds source code. – Ulrike Fischer Sep 14 '23 at 13:53
  • 1
    @UlrikeFischer Thanks. For submission purposes, I'm not too worried. In the medium term, I'll probably have to think of an alternative approach to student details. If a PDF is split into pages, is there the possibility of a page containing information from other pages? If so, that's going to be a real pain to work around. – cfr Sep 14 '23 at 14:02
  • 4
    @cfr A pdf has lots of object that are not directly in the page stream. E.g. in the example above the information is in an embedded file. If that gets lost if you split a pdf depends on the pdf processor. – Ulrike Fischer Sep 14 '23 at 14:49
  • Thanks @UlrikeFischer. I'll see what I can find out about pdftk. The VLE does its best by making it virtually impossible for students to find even their own feedback, but I don't want them getting traces of everyone else's if they do find it ;). – cfr Sep 14 '23 at 15:16
3

Basically, the process of creating the PDF is not compilation and there's no decompilation possible. PDF is a format to describe the output of your document, the actual letters and images you put on the paper (be it real paper or virtual). Although, for practical reasons, hardly any program does it so, there isn't even a requirement for those letters to come in sequence — you could perfectly well create a PDF that prints all the letters 'a' on the page in whatever position they land on, then go on to print all b's, then all c's, and so on.

As others have noted, there are ways to include some code by using special features but that is probably something you have to work for hard. It doesn't happen in regular use. But if you're really paranoid :-), you have to use some kind of a PDF Explorer that gives you a programmer's view of what's inside. Even then, unless you do something really extra, there will be fonts, texts and images (bitmap or vector), nothing else (excluding regular metadata like document title, author, creation date and software creating the PDF, but I don't think you mean these, actually, these can be read directly in a PDF file with any simple text editor with the naked eye).

Gábor
  • 131
2

If this is really important to be sure is not a problem, you may want to post-process the PDF.

My suggestion would be simply reprocessing it to create a new PDF:

  • Use your favorite operating system and print it to a PDF file. You may need to do some experimenting to find a combination that does exactly what you want. Modern versions of MacOS and Windows have this built-in.
  • Print to a HP PCL file (their proprietary language) and reconvert that back into PDF.
  • Use something like Adobe Distiller or the Ghostscript equivalent, to create a version with the properties you want.