5

I have created two test documents. Here's foo.tex:

\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
Gödel
\end{document}

And here's foo2.tex:

\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
G\"odel
\end{document}

They were both compiled with pdflatex. The resulting PDFs look the same and "feel" the same: In both, SumatraPDF will find the search string "Gödel". In both, I can copy the word and paste it into a text editor. If I type "Gödel" into the search box of Windows, it finds both PDFs.

However, if I compare the PDFs with diff, it turns out they are not identical. What is the difference? Is there a reason to prefer one version over the other?

EDIT:

I was asked to clarify the question. My concern was not whether the files looked the same, so the old question about comparing the visual appearance of two PDFs didn't really help. My concern was whether the files behaved in the same way, e.g. if the ö would really be identified as umlaut-o by search engines and so on. I was wondering whether \"o would insert a "real character" or whether it would rather compose something out of different glyphs.

Frunobulax
  • 2,218
  • 2
    really don't use luainputenc (especially not ansinew) why not save the files in UTF-8 ? Nor use T1 encoding with luatex. (But ö I would expect to be encoded in the same way in both cases, what difference did you see?) – David Carlisle Mar 12 '21 at 14:16
  • 2
    To give an example of the issues created, your original file may have been in the so called "ansinew" (a Microsoft encoding that was not specified by ANSI nor is new) but when posted to this site it is converted to UTF-8, so for anyone copying the documents back to test it will not work at all/ – David Carlisle Mar 12 '21 at 14:19
  • What's the purpose of using the luainputenc package if you're not going to use Lua(La)TeX to compile the documents? – Mico Mar 12 '21 at 14:25
  • The first document produces this output with pdflatex as it is in UTF-8 but declares as "ansinew" – David Carlisle Mar 12 '21 at 14:27
  • The UTF-8 input encoding has been the default for LaTeX since 2018. In theory, if you just delete the luainputenc line, the body of the two documents should be the same. – Davislor Mar 12 '21 at 14:41
  • 3
    I don't think this is really an encoding issue at all. If you compile a "Hello world" program, make a copy of the file, compile again, and compare this result to the copy of the first run you'll also get a difference from diff. I guess diff is comparing also metadata like creation time, which of course will differ. – campa Mar 12 '21 at 14:41
  • 3
    There's also https://tex.stackexchange.com/q/229605/107497 for how to make a reproducible build. – Teepeemm Mar 12 '21 at 14:49
  • @campa What you wrote about the "Hello World" program is not true. diff compares the file contents and if you compile the same, say, C program twice with the same settings you'll get exactly the same output. File metadata like creation time is not stored in the file but in the file system. – Frunobulax Mar 12 '21 at 15:10
  • 1
    Which part is not true? The bit on the metadata was a guess, as I've explicitly said. On the other hand, the accepted (by you!) answer and the answer linked by Teepeemm make the very same claim. – campa Mar 12 '21 at 15:17
  • @campa The sentence "If you compile..." is not true. I accepted the answer because it showed that pdflatex stored metadata like the creation date directly in the file - which is not what, say, a C compiler would do. In my comment I just wanted to clarify that what is usually called metadata and what an operating system like Windows, Linux, or macOS will show you is not a part of the file. If diff would always compare metadata, then two files would almost never be reported as identical. See for example https://en.wikipedia.org/wiki/Inode – Frunobulax Mar 12 '21 at 15:34
  • 1
    I believe there is a misunderstanding here. With "compiling a hello world program" I meant of course a LaTeX hello world program, where the PDF files definitely has some metada incorporated in it. I never wanted to claim that this would hold e.g. for a C program. If this is the issue then I'm sorry, I should've been clearer. – campa Mar 12 '21 at 15:42

1 Answers1

8

If you prepare these two files asking for no PDF compression (call them one.tex and two.tex):

\documentclass{article}
\usepackage[T1]{fontenc}
\pdfcompresslevel 0\relax
\begin{document}
Gödel
\end{document}

and

\documentclass{article}
\usepackage[T1]{fontenc}
\pdfcompresslevel 0\relax
\begin{document}
G\"odel
\end{document}

and compile them with pdflatex, then look at the differences:

 [romano:~/tmp] 25s % diff -u -a one.pdf two.pdf
--- one.pdf 2021-03-12 15:48:03.703754062 +0100
+++ two.pdf 2021-03-12 15:46:57.370760208 +0100
@@ -79,8 +79,8 @@
 <<
 /Producer (pdfTeX-1.40.20)
 /Creator (TeX)
-/CreationDate (D:20210312154803+01'00')
-/ModDate (D:20210312154803+01'00')
+/CreationDate (D:20210312154657+01'00')
+/ModDate (D:20210312154657+01'00')
 /Trapped /False
 /PTEX.Fullbanner (This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian) kpathsea version 6.3.1)
 >>
@@ -162,7 +162,7 @@
 /W [1 2 1]
 /Root 11 0 R
 /Info 12 0 R
-/ID [<6BDFF40DFF895060C9767118A4C7BD04> <6BDFF40DFF895060C9767118A4C7BD04>]
+/ID [<0A94E0543DCD0E80E4351C51D3C26760> <0A94E0543DCD0E80E4351C51D3C26760>]
 /Length 56        
 >>
 stream

you see that, as suggested by @campa in comments, the differences are only in the metadata.

Rmano
  • 40,848
  • 3
  • 64
  • 125
  • 4
    +1 I was stuck finding which options for diff to use :-) – campa Mar 12 '21 at 14:57
  • 1
    For these types of experiments you can use regression-test.tex which will remove the PDF metadata and compression automatically. Example: http://dpaste.com/4B3AFU4GH (expires in 10 days) – Henri Menke Mar 12 '21 at 15:58