Why pdf file cannot be reproduced?

Question

I'm striking with an interesting problem. I created a pdf file long time ago. It can be downloaded here. Now just to recall how it was generated I decided to repeat the process and compare pdf files. I'm getting close, but I can't make the old and new pdf files look the same. I remember that I did everything in a usual way, I just changed the margins.

Why the pdf file can't be reproduced?

This is the process to generate new pdf file:

1) Get the latest cweb sources from ftp://ftp.cs.stanford.edu/pub/cweb/cweb-3.64ah.tgz

2) In cwebmac.tex change {NOS} fith to {NOS} fitb manually or with this command

perl -i -pe 's/{NOS} fith/{NOS} fitb/' cwebmac.tex

3) Add the following to the end of cwebmac.tex

\let\Blue=\Black
\hoffset=1.52400970458984374999999999999cm
\pageshift=2in
\advance\pageshift by-\hoffset
\advance\hoffset by-1in
\advance\pageshift by-1in

4) Build cweave

touch *.c
make

5) Run cweave on cweave.w

./cweave cweave.w

6) Generate the pdf file:

SOURCE_DATE_EPOCH=1460880679 pdftex cweave.tex

7) Now we compare old pdf with new pdf. For this we must uncompress objects in pdf files.

qpdf --qdf --object-streams=disable cweave.pdf cweave-long.pdf
qpdf --qdf --object-streams=disable cweave-old.pdf cweave-old-long.pdf
diff -u cweave-old-long.pdf cweave-long.pdf

We see in the diff that in new pdf a lot of values are less by exactly 0.001 than in old pdf. But I can't make this 0.001 disappear. If I set \hoffset to 1.52400970458984375, the values in new pdf is by 0.001 greater than in old pdf. And if I set \hoffset to 52400970458984374999999999999, the values are by 0.001 less in new pdf than in old pdf. I'm completely puzzled by this. Also, I remember to have set \hoffset to something simple, like 1.5cm, not this value which I constructed empirically by repeatedly comparing the diff.

Also, some hyphenation is changed. For example, the following is different in old and new pdf files:

-/F13 9.9626 Tf 125.8 495.045 Td [(i)]TJ/F3 7.9701 Tf 13.837 0 Td [(Used)-354(in)-354(secti)-1(o)1(n)]TJ
+/F13 9.9626 Tf 125.799 495.045 Td [(i)]TJ/F3 7.9701 Tf 13.837 0 Td [(Used)-354(in)-354(se)-1(ction)]TJ

i.e., how is

[(Used)-354(in)-354(secti)-1(o)1(n)]TJ

different from

[(Used)-354(in)-354(se)-1(ction)]TJ

? But, more importantly, why it is different? What this pdf code means, anyway?

Why pdftex can't reproduce the pdf file?

This is the link to download the diff file: https://www.dropbox.com/s/lvbijcn2689cuye/cweave.diff?dl=1

What is the purpose of step 2? How does it affect the comparison if you leave it out? — Henri Menke, Nov 09 '16 at 10:15
@HenriMenke this change brings cwebmac.tex to the state in which it was when old pdf file was created, in order to remove irrelevant information from the diff of old and new pdf files — Igor Liferenko, Nov 10 '16 at 02:50
@jfbu No need to register - just click "No thanks, continue to view" at the bottom of registration form. BTW, which site do you use to store files for download? — Igor Liferenko, Nov 10 '16 at 07:42

score 7 · Accepted Answer · edited Jun 10 '20 at 12:32

final addition: when does `<foo> cm = <bar> in` hold for TeX ?

passing mathematical details and barring mathematical oversight in my quick investigation it is necessary and sufficient that the rounding of <foo> to an entire multiple f/65536 of 1/65536 gives an f which is multiple of 127.

Then and only then is there some <bar> in which gives exactly the same TeX dimension. And the rounding of <bar> to an integral multiple g/65536 of 1/65536 will be with g a multiple of 50. And vice versa.

Example: we seek such a dimension nearby 0.6in. We need a multiple of 50 nearby 0.6*65536=39321.6, hence 39300 or 39350. The former is obtained via 0.59967in (and decimals nearby) and the latter via 0.60043in (and decimals nearby).

In fact, for TeX we have exactly

0.59967in = 1.52316cm = 2840211sp

0.60043in = 1.52510cm = 2843824sp

\number\dimexpr0.59967in\relax =\number\dimexpr1.52316cm \relax
\number\dimexpr0.60043in\relax =\number\dimexpr1.52510cm \relax
\bye

But, as is the main cause of the OP, 0.6in has no exact equivalent using cm unit. In terms of sp units the possible N sp which admit both a representation as <foo> in and <bar> cm are those with N = int(3613.5*k) for some integer k. Above, k is respectively 786 and 787.

update: explanation how TeX scans dimensions

It has been clarified below and in comments that the cause of the difference was in the fact that you used formerly in as dimension units, and in your reconstruction attempt were using rather cm. Further, it is not exactly the same to make the difference between two dimensions originally expressed in in and to specify directly the result in in's because 1in is not an integral number of sp, the smallest unit used by TeX.

In https://tex.stackexchange.com/a/231281/4686 I explained how TeX handles dimensions like abc.xyz... pt. This is from §452 of tex.pdf (texdoc tex.pdf). Turns out that part is reused also for abc.xyz... in, hence I recall the result:

the fraction abc.xyz... is rounded (the rounding being away from zero) to an integral multiple of 1/65536

in this process, the algorithm may discard (without affecting the theoretical result) all except the first 17 digits after the decimal mark.

Let's say that we have the "unpacked" result in the form of a pair (n, f) which represents at this stage the fraction (65536n+f)/65536.

Now TeX takes into account the dimension unit. This is explained at §458 of tex.pdf. The easy case is pt (which was handled earlier at §453), TeX simply puts 65536*n+f in the register. For the case of in, we morally need to do (65536*n+f)*7227/100. Of course the issue here is to not create overflows with integer arithmetic. The routine which does that is at §107. Let us write x=65536*n+f. The operations (simplified here, no signs, no overflow intercept) are

t<- (x mod 32768)*7227

u<- (x div 32768)*7227 + (t div 32768)

v<- (u mod 100)*32768 + (t mod 32768)

RESULT<- 32768*(u div 100) + (v div 100)

clarification to be very precise TeX applies this routine not to the integer x=65536 n + f, but to the (half-word) integer n, and the routine above also returns the "remainder" v mod 100, which is then appropriately combined with f and the scale 7227/100, so that almost up to the end data is maintained in shape (N, F) representing N + F/65536 where the exact F has been truncated to an integer; and at the very last step the attach_fraction sub-routine puts y = 65536*N+F into the register. But this final integer y is exactly as if the procedure above had been applied to x = 65536 n +f. Which would not be possible directly due to arithmetic overflows.

Following steps x = 32768 q + r, t = 7227r, u=7227q + int(7227r/32768) = 7727x/32768 - 7227r/32768 + int(7227r/32768) which mean that u (which is an integer) =7227x/32768 - d with 0<= d < 1. Hence u = int(7227x/32768). Writing u =100*(u div 100) + (u mod 100) = 100*(RESULT/32768 - (v div 100)/32768) + (v/ 32768 - (t mod 32768)/32768) = 100*RESULT/32768 + (v mod 100)/32768 - (t mod 32768)/32768, we get RESULT = x*7227/100 - 32768*d/100 - (v mod 100)/100 + (t mod 32768)/100. But the d was t/32768 - int(t/32768) = (t mod 32768)/32768, in fact. Thus all simplifies to RESULT= x*7227/100 - (v mod 100)/100. As RESULT is an integer, and as x*7227 is an integer, this means that, exactly, RESULT is the truncation of x*7227/100to an integer.

We can sum up the whole thing into:

first round the decimal abc.xyz.. into an integral multiple x of 1/65536. In doing this we need keep only 17 fractional digits of the input.

multiply by conversion factor 72.27 and then truncate to an integral number of sp units.

Ah, damn'd I wanted to illustrate the cm thing. Well then the conversion factor from §458 is 7227/254 and all of the above with 100 replaced by 254 goes through.

Let's look thus at 1.52400970458984374999999999999cm. First we need only 17 digits after decimal mark. Count digits and see we can simplify input to 1.52400970458984374. Then we need to multiply by 65536 and round. Take your preferred engine. Some like things like Maple, etc.. others enjoy xint.

$ rlwrap etex -jobname worksheet
This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=etex)
 restricted \write18 enabled.
**xintexpr.sty
entering extended mode
(/usr/local/texlive/2016/texmf-dist/tex/generic/xint/xintexpr.sty
(/usr/local/texlive/2016/texmf-dist/tex/generic/xint/xintfrac.sty
(/usr/local/texlive/2016/texmf-dist/tex/generic/xint/xint.sty
(/usr/local/texlive/2016/texmf-dist/tex/generic/xint/xintcore.sty
(/usr/local/texlive/2016/texmf-dist/tex/generic/xint/xintkernel.sty))))
(/usr/local/texlive/2016/texmf-dist/tex/generic/xint/xinttools.sty))
*\def\x #1;{\message{\xinttheiexpr[40] #1\relax}}% (this rounds)
\x 655361.52400970458984374;
99877.4999999999993446400000000000000000000000
\x 655361.52400970458984375;
99877.5000000000000000000000000000000000000000

we are at a border case here, and the first case gives 99877. Then we do the second step which is 99877*7227/254:

*\x 99877*7227/254;
2841775.9015748031496062992125984251968503937008

which we must truncate: the result is 2841775sp.

In the second case we get first via rounding 99878, and then we must evaluate 99878*7227/254:

*\x 99878*7227/254;
2841804.3543307086614173228346456692913385826772

hence the result is 2841804sp.

Let's now take the case of the 0.6 in specification. We repeat the above. First 0.6*65536=39321.6 is rounded to 39322. Then 39322*7227/100 = 2841800.94 is truncated to 2841800sp.

You will find all these values below.

For the record 1.4in: 1.4*65536=91750.4, rounded to 91750 then 91750*7227/100=6630772.5 is truncated to 6630772sp.

this is not an answer but is too long for a comment

were you using cm as dimension unit back then ?

edit I think you were using 0.6in. See at bottom.

there is a (surprising to me currently because I have completely forgotten the details of TeX input process for dimensions at this stage) sensibility to using cm as dimension unit.

{
\dimen2=1.52400970458984374999999999999cm
\number\dimen2
\dimen2=1.52400970458984375cm
\number\dimen2
\dimen2=1.52400970458984374cm
\number\dimen2
}
\input xintexpr.sty
\xinttheiexpr [50] 1.5240097045898437499999999999972.27/2.5465536\relax
\xinttheiexpr [50] 1.5240097045898437572.27/2.5465536\relax
\xinttheiexpr [50] 1.5240097045898437472.27/2.5465536\relax
\bye

Update:

{
\dimen2=1.52400970458984374999999999999cm
\number\dimen2
\dimen2=1.52400970458984375cm
\number\dimen2
\dimen2=1.52400970458984374cm
\number\dimen2
}
\input xintexpr.sty
\xinttheiexpr [50] 1.5240097045898437499999999999972.27/2.5465536\relax
\xinttheiexpr [50] 1.5240097045898437572.27/2.5465536\relax
\xinttheiexpr [50] 1.5240097045898437472.27/2.5465536\relax
\number\dimexpr 2.54cm\relax
\number\dimexpr 1in\relax
\number\dimexpr 254cm\relax
\number\dimexpr 100in\relax
\xinttheiexpr [50] 1.52400970458984374/2.54\relax
{
\dimen2=0.6in
\number\dimen2
}
\bye

% TeX value
{
\dimen2=0.6in
\number\dimen2
}
% exact value:
\xinttheiexpr [10] 0.672.2765536\relax
\bye

I'm not sure I understand how you concluded 0.6 in: does it give the same results as the OP's PDFs? (I haven't checked, and I don't see a mention in the answer…) Or are you saying that "0.6 in" is what is most likely the OP wrote instead of "1.52400970458984374999999999999cm"? (If the latter, nice find, and probably the right guess! Not sure if that will solve the OP's mystery though.) — ShreevatsaR, Nov 10 '16 at 18:01
@ShreevatsaR I agree with \hoffset=0.6in. Then the annotation rectangles are the same. — Heiko Oberdiek, Nov 10 '16 at 18:03
@ShreevatsaR the second: I just (belatedly ;-) ) converted Igor's amazing cm spec into in unit and it turned out very close to 0.6in. That's all ! — , Nov 10 '16 at 18:07
@HeikoOberdiek do you know the reason of the other differences? This is the new diff https://www.dropbox.com/s/0s5yqihl7cvoxwz/cweave2.diff?dl=1 — Igor Liferenko, Nov 11 '16 at 01:20
@IgorLiferenko The differences look like tiny rounding issues. If you want to debug it, then both pdfTeX versions (1.40.16 and 1.40.17) are needed as well as other files needed to compile the example (fonts, ...). If the old file can be reproduced with the older versions, then experiments can be done (e.g., running the newer pdfTeX on the old files, running the older pdfTeX on the new files, ...) to figure out, which component is responsible for the rounding changes. — Heiko Oberdiek, Nov 11 '16 at 02:44
@HeikoOberdiek Except the rounding there is strange hyphenation differences - see in the bottom of OP. Now with \hoffset=0.6in the difference of 0.001 disappeared from the diff of pdf files, but there is new complication: in ps file all values are exactly by 0.001 greater. How to remove this difference? (I convert cweave-old.pdf and cweave.pdf via pdf2ps) This is the link to the ps diff https://www.dropbox.com/s/a6jgcifl0u9f2ql/cweave-ps.diff?dl=1 — Igor Liferenko, Nov 11 '16 at 02:53
@HeikoOberdiek pdfTeX is not the issue - the problem is in cwebmac.tex. This is the link to another old pdf file which was created at the same time as the first old pdf file https://www.dropbox.com/s/l6a2jhted2f14kc/cweave-old-default.pdf?dl=1 Compare it with pdf file generated the same way as described in OP, but without changing cwebmac.tex. Now the pdf files and ps files do not differ. So, the reason is only somewhere in cwebmac.tex. How to figure it out based on the diff of the pdf files? — Igor Liferenko, Nov 11 '16 at 03:25
@IgorLiferenko Why don't you diff the cwebmac.tex files? Which versions of that file are you comparing? — cfr, Nov 11 '16 at 03:39
@cfr that's the point - I don't have the old cwebmac.tex file. I want to recreate it based on comparing the resulting pdf files. — Igor Liferenko, Nov 11 '16 at 03:46
@IgorLiferenko But why? Why not get a copy of the relevant file? Even if you don't know the version for sure, you must be able to estimate from the PDF compilation date. I doubt this file has undergone frequent radical changes. In the decade between 2006 and 2016, the file was revised only once with only 4 changes aside from the lines bumping the version. — cfr, Nov 11 '16 at 03:52
@HeikoOberdiek It is done! Strangely enough, when I replaced \pageshift=2in \advance\pageshift by-\hoffset with \pageshift=1.4in all the remaining differences disappeared altogether. But why? — Igor Liferenko, Nov 11 '16 at 04:58
@IgorLiferenko The smallest unit is 1 sp in TeX. Unhappily, 1 in is not a multiple of 1 sp, therefore rounding occurs. Compare 9472573 sp (2 in) - 2841800 sp (.6 in) = 6630773 sp (2 in - 0.6 in) with 6630772 sp (1.4 in). — Heiko Oberdiek, Nov 11 '16 at 05:52

Why pdf file cannot be reproduced?

1 Answers1

final addition: when does `<foo> cm = <bar> in` hold for TeX ?

update: explanation how TeX scans dimensions

Linked

Why pdf file cannot be reproduced?

1 Answers1

final addition: when does <foo> cm = <bar> in hold for TeX ?

update: explanation how TeX scans dimensions

Linked

final addition: when does `<foo> cm = <bar> in` hold for TeX ?