Speed of includegraphics seemingly dependent on how many packages are loaded!

Question

Here is test file for (Plain) pdftex. To test it, you need some graphics file foo.pdf in working repertory. Mine is a copy of the file found via kpsewhich example-image-a.pdf. Call the following testspeedgraphics.tex and execute pdftex testspeedgraphics on command line.

\input graphicx.tex 

\def\B{\noexpand\A\noexpand\A\noexpand\A\noexpand\A\noexpand\A%
       \noexpand\A\noexpand\A\noexpand\A\noexpand\A\noexpand\A}%
\edef\C{\B\B\B\B\B\B\B\B\B\B}% 100 \A

\def\A{\noexpand\C}%
\edef\D{\C}% 100 \C, each one expanding to 100 \A

\def\A{\setbox0\hbox{\foo}}

\def\foo{\includegraphics{foo.pdf}}

\pdfresettimer
\D % 10000 usage of \includegraphics
\edef\zzz{\the\dimexpr\pdfelapsedtime sp}
\message{^^J^^J**** \zzz ****^^J^^J}

\input xintexpr.sty

\pdfresettimer
\D % 10000 usage of \includegraphics
\edef\zzz{\the\dimexpr\pdfelapsedtime sp}
\message{^^J^^J**** \zzz ****^^J^^J}

\input tikz.tex

\pdfresettimer
\D % 10000 usage of \includegraphics
\edef\zzz{\the\dimexpr\pdfelapsedtime sp}
\message{^^J^^J**** \zzz ****^^J^^J}

\input xlop.tex

\pdfresettimer
\D % 10000 usage of \includegraphics
\edef\zzz{\the\dimexpr\pdfelapsedtime sp}
\message{^^J^^J**** \zzz ****^^J^^J}

\bye

This file will do 4 times 10000 \includegraphics{foo.pdf} in \box0. We load more and more packages. On a 2.8GHz computer I get this typically in the console output:

**** 1.12306pt****

(loading of xintexpr)

**** 1.19366pt****

(loading of tikz)

**** 1.36714pt****

(loading of xlop)

**** 1.40392pt****

Hence there is increase of timings, and one feels the bigger the package the more the impact.

Now comment out all loading of extra packages in test file above and repeat experiment. I get typically

**** 1.13177pt****



**** 1.12141pt****



**** 1.14122pt****



**** 1.12416pt****

i.e. there no timings drift...

Now another test file, where we still include graphicx.tex but do no usage of \includegraphics, rather we expand a dummy macro \foo. This being much faster we do 1000000 repetitions.

\input graphicx.tex 

\def\B{\noexpand\A\noexpand\A\noexpand\A\noexpand\A\noexpand\A%
       \noexpand\A\noexpand\A\noexpand\A\noexpand\A\noexpand\A}%
\edef\C{\B\B\B\B\B\B\B\B\B\B}% 100 \A

\def\A{\noexpand\C}%
\edef\D{\C}% 100 \C, each one expanding to 100 \A

\def\A{\noexpand\D}%
\edef\E{\C}% 100 \D, each one expanding to 100 \C

\def\A{\setbox0\hbox{\foo}}

\def\foo{foo}

\pdfresettimer
\E % 1000000 "foo"
\edef\zzz{\the\dimexpr\pdfelapsedtime sp}
\message{^^J^^J**** \zzz ****^^J^^J}

\input xintexpr.sty

\pdfresettimer
\E % 1000000 "foo"
\edef\zzz{\the\dimexpr\pdfelapsedtime sp}
\message{^^J^^J**** \zzz ****^^J^^J}

\input tikz.tex

\pdfresettimer
\E % 1000000 "foo"
\edef\zzz{\the\dimexpr\pdfelapsedtime sp}
\message{^^J^^J**** \zzz ****^^J^^J}

\input xlop.tex

\pdfresettimer
\E % 1000000 "foo"
\edef\zzz{\the\dimexpr\pdfelapsedtime sp}
\message{^^J^^J**** \zzz ****^^J^^J}

\bye

This test file does all the \input's of extra packages. Typically I get from pdftex testspeedfoo.tex:

**** 0.48016pt****

(xintexpr)

**** 0.49115pt****

(tikz)

**** 0.48283pt****

(xlop)

**** 0.47798pt****

i.e. no impact at all.

Now, why is there impact with \includegraphics. Is it simply because its expansion is much more complex, and if yes what is explanation? or is it something specific related to \includegraphics dealings and then again what is explanation?

Perhaps something having to do with hash-table? the more macros are defined the less efficient TeX is in expanding macros? (then the thing would be not \includegraphics specific, I stopped my testing there, leaving the experts to express their intuitions first).

Originally, this question arose in the context of a Joseph answer using xfp and \includegraphics. I wanted to test if using draft option of \includegraphics had an impact and then realized that loading or not xfp changed the timings. Then I realized it had nothing xfp specific, but any big package would do.

I also tested with this variant of \foo:

\def\A{\setbox0\hbox{\foo}}

\def\foo{\fooa}
\def\fooa{\foob}
\def\foob{\fooc}
\def\fooc{\food}
\def\food{\fooe}
\def\fooe{\foof}
\def\foof{\foog}
\def\foog{\fooh}
\def\fooh{\fooi}
\def\fooi{\fooj}
\def\fooj{\fook}
\def\fook{\fool}
\def\fool{\foom}
\def\foom{\foon}
\def\foon{\fooo}
\def\fooo{\foop}
\def\foop{\fooq}
\def\fooq{\foor}
\def\foor{\foos}
\def\foos{\foot}
\def\foot{\foou}
\def\foou{\foov}
\def\foov{\foow}
\def\foow{\foox}
\def\foox{\fooy}
\def\fooy{\fooz}
\def\fooz{\fooA}
\def\fooA{\fooB}
\def\fooB{\fooC}
\def\fooC{\fooD}
\def\fooD{\fooE}
\def\fooE{\fooF}
\def\fooF{\fooG}
\def\fooG{\fooH}
\def\fooH{\fooI}
\def\fooI{\fooJ}
\def\fooJ{\fooK}
\def\fooK{\fooL}
\def\fooL{\fooM}
\def\fooM{\fooN}
\def\fooN{\fooO}
\def\fooO{\fooP}
\def\fooP{\fooQ}
\def\fooQ{\fooR}
\def\fooR{\fooS}
\def\fooS{\fooT}
\def\fooT{\fooU}
\def\fooU{\fooV}
\def\fooV{\fooW}
\def\fooW{\fooX}
\def\fooX{\fooY}
\def\fooY{\fooZ}
\def\fooZ{\fooaa}
\def\fooaa{\foobb}
\def\foobb{\foocc}
\def\foocc{\foodd}
\def\foodd{\fooee}
\def\fooee{\fooff}
\def\fooff{\foogg}
\def\foogg{\foohh}
\def\foohh{\fooii}
\def\fooii{\foojj}
\def\foojj{\fookk}
\def\fookk{\fooll}
\def\fooll{\foomm}
\def\foomm{\foonn}
\def\foonn{\foooo}
\def\foooo{\foopp}
\def\foopp{\fooqq}
\def\fooqq{\foorr}
\def\foorr{\fooss}
\def\fooss{\foott}
\def\foott{\foouu}
\def\foouu{\foovv}
\def\foovv{\fooww}
\def\fooww{\fooxx}
\def\fooxx{\fooyy}
\def\fooyy{\foozz}
\def\foozz{\fooAA}
\def\fooAA{\fooBB}
\def\fooBB{\fooCC}
\def\fooCC{\fooDD}
\def\fooDD{\fooEE}
\def\fooEE{\fooFF}
\def\fooFF{\fooGG}
\def\fooGG{\fooHH}
\def\fooHH{\fooII}
\def\fooII{\fooJJ}
\def\fooJJ{\fooKK}
\def\fooKK{\fooLL}
\def\fooLL{\fooMM}
\def\fooMM{\fooNN}
\def\fooNN{\fooOO}
\def\fooOO{\fooPP}
\def\fooPP{\fooQQ}
\def\fooQQ{\fooRR}
\def\fooRR{\fooSS}
\def\fooSS{\fooTT}
\def\fooTT{\fooUU}
\def\fooUU{\fooVV}
\def\fooVV{\fooWW}
\def\fooWW{\fooXX}
\def\fooXX{\fooYY}
\def\fooYY{\fooZZ}
\def\fooZZ{foo}

to try to emulate the case with expansions of many distinct macros. But this does not show any drifting when loading packages, i.e. I don't reproduce with that the \includegraphics situation: all four executions of 1000000's \A take each about 2.9s--3s on my 2.8GHz computer.

Haven't looked at this, so a few remarks: (1) It is plausible that the answer has something to do with hash tables: as more entries are inserted, there are more hash collisions, which means a longer list to search in, to find the right element. (But IIRC there are some tricks like bringing the most recently accessed element to the front of the list…) (2) Note that the test involves more than merely expanding macros… (3) The fact that the times vary so much (in your second example, i.e. “foo”) makes me think \pdfelapsedtime may not be the best to measure… (4) which distribution? TL on Linux? — ShreevatsaR, Aug 26 '18 at 20:40
@ShreevatsaR I am on mac os x (10.9.5), with TL2018 and self-compiled binaries. — , Aug 27 '18 at 16:40
@ShreevatsaR about the timings varying, yes, but I always observe about that amount of intrinsic variations with \pdfelapsedtime. I did not dis-conect internet access during testing, and this is main source of perturbance (mail and browser software, Apple downloading all my files without me knowing it, etc...) — , Aug 27 '18 at 16:47
I tried unpacking \includegraphics a bit, to see which part of its definition contributed to the difference. Though there may be multiple parts, try this: \def\foo{\filename@parse {foo.pdf} \Gin@getbase {\Gin@sepdefault \filename@ext }} -- this seems to exhibit a similar stark difference (especially between first two and last two). (Also, replacing \Gin@sepdefault by its expansion . seems to make the program quite a bit slower!) — ShreevatsaR, Aug 31 '18 at 18:13
Without any packages :-) \def\foo{\openin0{foo.pdf}\closein0{foo.pdf}} (Though I repeat that this may not be the only reason for \includegraphics as a whole...) — ShreevatsaR, Aug 31 '18 at 18:35
This sort of thing would probably be easier to debug in LuaTeX where the code that's compiled is not so unreadable, but unfortunately(?) the behaviour doesn't seem to carry over to LuaTeX. — ShreevatsaR, Sep 01 '18 at 01:16
Something interesting and possibly relevant: compare the two: (1) if we save the filename first into a macro, as \def\filename{foo.pdf} and then \def\foo{\openin0\filename\closein0\filename}, then it's faster but similar pattern. (2) But if we save the filename first as \toks0={foo.pdf} and then \def\foo{\openin0\the\toks0\closein0\the\toks0} then the observed times follow a different pattern. — ShreevatsaR, Sep 01 '18 at 01:41
I found the culprit (using some combination of gdb, scripts, and gdb scripts), it's the function zsearchstring in pdftex0.c, specifically its loop like while (s > 255) { if (strstart[s+1] - strstart[s] == len) { ... } decr(s); } which runs much more frequently (basically the function is called with a larger strnumber each time). I'll try to post an answer eventually (maybe after understanding it even better -- though I think have a fair idea now). — ShreevatsaR, Sep 07 '18 at 17:01
(Incidentally, this appears to not be code that DEK wrote... it's in "[54/web2c-string] The string recycling routines. [...] TeX uses 2 upto 4 new strings when scanning a filename in an \input, \openin, or \openout operation. These strings are normally lost because the reference to them are not saved after finishing the operation. search_string searches through the string pool for the given string and returns either 0 or the found string number." -- Why/how did someone think it a good idea to search through the entire string pool for a particular string?!) — ShreevatsaR, Sep 07 '18 at 17:03
One possible reason (speculating) might be awful memory constraints. Probably it would be much better now to just cut out this string-recycling, instead of going to absurd lengths to reuse the same string for the filename. Or just change the entire string pool mechanism altogether, let such strings be garbage-collected, etc. — ShreevatsaR, Sep 07 '18 at 17:07
Short version of answer: As TeX does things like define macros, it adds strings to a string pool. In the implementations of TeX based roughly on web2c (such as those distributed with TeX Live and MikTeX), all file operations (those that invoke something like \openin, like \includegraphics in this example) involve a search through the entire string pool for the filename. So they will get slower as the string pool gets larger (such as when we load new packages that define macros). — ShreevatsaR, Sep 07 '18 at 17:51
@ShreevatsaR this is for sure one of the best and thoroughest answers I ever read on this site! Thanks a lot for the detailed and well-written report of your time-costly investigation. Although having my ancestors from a French region known to be very economical of their money, I am ready to offer a stupendous bounty prize of at least 100 points! — , Sep 09 '18 at 09:13

score 29 · Accepted Answer · answered Sep 09 '18 at 03:45

This was quite a puzzle.

This answer will be long, to match the long time it took me to get to the bottom of this, but the sections are titled and numbered so you can skip over the ones you don't care about. You can even skip to the one-sentence summary at the end. :-)

1. The debugging process

This will describe how I arrived at the answer. If you don't care, and only want the answer, you can skip to the next section.

1.1. Reproducing the problem

I took the file at the top of the question, saved it into a file jfbu1.tex, also created foo.pdf by copying it as cp $(kpsewhich example-image-a.pdf) foo.pdf, then ran pdftex jfbu1.tex, and sure enough, saw output like:

This is pdfTeX, Version [etc, some lines skipped]

**** 0.87534pt****

 (/usr/share/texlive/texmf-dist/tex/generic/xint/xintexpr.sty (...))

**** 0.93637pt****

 (/usr/share/texlive/texmf-dist/tex/plain/pgf/frontendlayer/tikz.tex (...))

**** 1.03432pt****

 (/usr/share/texlive/texmf-dist/tex/generic/xlop/xlop.tex [...])

**** 1.06012pt****

where the numbers are increasing.

1.2. Simplifying the test case

As what was being measured involved \includegraphics{foo.pdf}, I tried to unpack the definition of \includegraphics to see if it could be further simplified. (I also tried other things, like replacing \D with 10000 occurrences of the expansion of \A directly, but that didn't make a difference.) Well, \show\includegraphics shows that its definition is \leavevmode \@ifstar {\Gin@cliptrue \Gin@i }{\Gin@clipfalse \Gin@i }, so let's replace \includegraphics in our file with that, see if the phenomenon is still observed, and then try removing parts of it. (E.g. the first case of the \@ifstar conditional is probably irrelevant, as we don't have a * in this example, so we won't take that path.)

I did this a little, mostly relying on \show, and cursing the complexity of LaTeX (just writing out the full definition of \includegraphics in terms of TeX primitives would be immense). (Small tip: next time use the LaTeX source code rather than \show; it's a bit easier to read.)

Anyway, after a few steps of this (a few steps deep into the definition of \includegraphics), I observed that we could replace the original file's

\def\foo{\includegraphics{foo.pdf}}

with (using just an early part of the definition of \includegraphics):

\def\foo{\filename@parse {foo.pdf} \Gin@getbase {\Gin@sepdefault \filename@ext }}

and the phenomenon of time increasing as we load more packages was still observed (though it ran faster). In fact, proceeding a bit, this was sufficient:

\def\foo{\IfFileExists{foo.pdf}{}{}}

— apparently, simply testing whether a file exists takes longer as we load more packages!

A few more steps, and we can get down to TeX primitives (no loading of graphicx necessary):

\def\foo{\openin0{foo.pdf}\closein0{foo.pdf}}

— which simply opens and closes the file (apparently this is part of how \IfFileExists works internally).

After making a few more changes — for example, we don't need the trick to convert \pdfelapsedtime into seconds(?) as we care more about the fact that it's increasing than about its actual value, and the phenomenon is consistently observed even with much fewer repetitions instead of 10000 (say 81, replacing 10 with 3) — we arrive at the following slightly simpler file jfbu2.tex, where the phenomenon is still observed:

\def\B{\noexpand\A\noexpand\A\noexpand\A}
\edef\C{\B\B\B} % Now the definition of \C is 9 times \A
\def\A{\noexpand\C}
\edef\D{\C} % Expands to 9 \C, each of which will expand to 9 \A

\def\filename{foo.pdf}
\def\A{\setbox0\hbox{\openin0\filename\closein0\filename}}


\pdfresettimer\the\pdfelapsedtime
\D
\message{^^J^^J****\the\pdfelapsedtime****^^J^^J}

\input xintexpr.sty

\pdfresettimer\the\pdfelapsedtime
\D
\message{^^J^^J****\the\pdfelapsedtime****^^J^^J}

\input tikz.tex

\pdfresettimer\the\pdfelapsedtime
\D
\message{^^J^^J****\the\pdfelapsedtime****^^J^^J}

\input xlop.tex

\pdfresettimer\the\pdfelapsedtime
\D
\message{^^J^^J****\the\pdfelapsedtime****^^J^^J}

\bye

and this still prints output like (not always the same numbers, but the pattern of increase is consistent):

This is pdfTeX, Version [etc, some lines skipped]

***74***

[loading of xintexpr.sty]

***96***

[loading of tikz.tex]

****142****

[loading of xlop.tex]

****149****

(About the useless \the\pdfelapsedtime before the \D, which simply inserts "0" into the typeset output, I'll say more below.)

And the mystery remains: the \A above seems like it should just do a constant number of operations, why does it take longer as more packages are loaded?

Now that we're down to TeX primitives, there's nothing further to unpack at the macro (the “TeX programming”) level, and we need to look into the TeX program itself.

1.3. Debugging with source code

We can step through the TeX program with a debugger like gdb.

To do this it would usually be easier to work with the LuaTeX source code, as that's (sort of) written directly in C, rather than the source code of TeX/pdfTeX/XeTeX, as they undergo several rounds of mangling from the .web source to the C code before being compiled. (So the C code that is finally compiled is more readable in the case of LuaTeX.) But it turns out (after appropriate definitions of \pdfelapsedtime) that this phenomenon we're interested in does not happen in LuaTeX, so we're stuck with pdfTeX. (It does happen in TeX / XeTeX, where without the convenience of \pdfelapsedtime we can just observe visually that it takes longer.)

1.3.1. Building for `gdb`

To use gdb, the program needs to be compiled with -g, and also binaries shouldn't be stripped. Fortunately I'd done this before: the trick is, when building from TeX Live sources (see here and here), instead of

./Build

to use

./Build --no-clean -g

and update $PATH to use the new binaries appropriately. (Or, if using make install strip, should replace with make install.)

1.3.2. First steps with gdb

One can start gdb with a command like gdb pdftex (make sure PATH is right, or else specify the full path to our specially compiled pdftex binary). Then, one can set breakpoints, before running the program (as if we'd run pdftex jfbu2.tex on the commandline) with run jfbu2.tex.

Which breakpoints to set? We'd like to stop when some particular function is called, which doesn't get called too often. My choice was the function called by \pdfelapsedtime (though in hindsight I guess using the one for pdfresettimer would have been better) which with some looking at the source code and/or gdb, happens to be (or call) getmicrointerval. (This is the reason for the “extra” \the\pdfelapsedtime in the file above, because I want to break there.)

So we can start gdb, set break getmicrointerval, and run the program, and it will stop after reaching the place where the function is called. Then we can type continue to continue until the next breakpoint (or end of program), or type next to invoke the next statement of the program (stepping over function calls, i.e. not descending into them) or step to do the same while stepping into function calls. As you keep hitting Enter, it will show you each function that's called, and each line of source that's executed.

After doing this a little, it's clear that it will take a long time to do this manually.

1.3.3. Scripting gdb

Long story short: put the following in ~/.gdbinit:

define mystep
    step
    refresh
end

define keepstepping
  while(1)
    step
  end
end

set pagination off
set logging on
file pdftex
break getmicrointerval
run jfbu2.tex
continue

keepstepping

This is like typing "step" and hitting Enter a few million times manually until the program finishes, and everything that gdb outputs will be written to file gdb.txt.

With this, the whole thing ran for a few hours, and produced a gdb.txt that was over 700 MB in size, from over 20 million lines.

The start of the file looks something like this:

Breakpoint 1 at 0x84bad: file pdftex0.c, line 3471.

Breakpoint 1, getmicrointerval () at pdftex0.c:3471
3471      secondsandmicros ( s , m ) ;

Breakpoint 1, getmicrointerval () at pdftex0.c:3471
3471      secondsandmicros ( s , m ) ;
get_seconds_and_micros (seconds=0x7fffffffd92c, micros=0x7fffffffd928) at ../../../texk/web2c/lib/texmfmp.c:2329
2329      gettimeofday(&tv, NULL);
2330      *seconds = tv.tv_sec;
2331      *micros  = tv.tv_usec;
2342    }
getmicrointerval () at pdftex0.c:3472
3472      if ( ( s - epochseconds ) > 32767 )
3474      else if ( ( microseconds > m ) )
3477      else Result = ( ( s - epochseconds ) * 65536L ) + ( ( ( m - microseconds )
3478      / ((double) 100 ) ) * 65536L ) / ((double) 10000 ) ;
3477      else Result = ( ( s - epochseconds ) * 65536L ) + ( ( ( m - microseconds )
3479      return Result ;
3480    }
zscansomethinginternal (level=5 '\005', negative=0) at pdftex0.c:11926
11926             break ;
11990           curvallevel = 0 ;
12059       break ;
12115     while ( curvallevel > level ) {
12123     if ( negative ) {

(The first Breakpoint 1 at 0x84bad: file pdftex0.c, line 3471 is printed when we set the breakpoint; we had continue after the first time gdb paused at the breakpoint so there's no output until the next.) The part shown in the output above is common to each time \pdfelapsedtime is called (we haven't even got to the \D part yet).

Of course we can't process this 20-million several-hundred-megabyte file by reading through it manually. In fact, grep --line-number Breakpoint gdb.txt can be used to see the number of program steps executed between successive occurrences of \pdfelapsedtime (calls to getmicrointerval):

4:Breakpoint 1 at 0x84bad: file pdftex0.c, line 3471.
9:Breakpoint 1, getmicrointerval () at pdftex0.c:3471
12:Breakpoint 1, getmicrointerval () at pdftex0.c:3471
100418:Breakpoint 1, getmicrointerval () at pdftex0.c:3471
3631431:Breakpoint 1, getmicrointerval () at pdftex0.c:3471
3766236:Breakpoint 1, getmicrointerval () at pdftex0.c:3471
11906822:Breakpoint 1, getmicrointerval () at pdftex0.c:3471
12159055:Breakpoint 1, getmicrointerval () at pdftex0.c:3471
20605166:Breakpoint 1, getmicrointerval () at pdftex0.c:3471

This shows that

loading xintexpr took about 100418-12=100406 steps,
the \D after that took about 3631431-100418=3531013 steps,
loading tikz took about 3766236-3631431=134805 steps,
the \D after that took about 11906822-3766236=8140586 steps,
loading xlop took about 12159055-11906822=252233 steps,
the \D after that took about 20605166-12159055=8446111 steps

where we can see the increase in the numbers in bold above. (Missed the first \D because of the “continue”.)

1.3.4. Processing gdb.txt

The main idea is that although the file is 20 million lines long, the set of different lines executed is much smaller, and what we want to compare is which lines were executed more frequently between each successive pairs of breakpoints.

We can keep counters of which lines were executed how many times between successive occurrences of “Breakpoint 1, getmicrointerval” in the file. Used the following Python script:

pattern = 'Breakpoint 1, getmicrointerval'
f = open('gdb.txt', 'r')
line = f.readline()
while pattern not in line:
  line = f.readline()
  print line
# Now, line has an occurrence of pattern

# Counter 0: From occurrence 0 to occurrence 1
# Counter 1: From occurrence 1 to occurrence 2
# etc.
from collections import Counter
c = {}

i = 0
while line:
  assert pattern in line
  line = f.readline()
  cur = Counter()
  while pattern not in line:
    cur[line] += 1
    line = f.readline()
    if not line: break

  if i % 2 == 0 and i > 0:
    for _ in range(10): print
    print i
    frequent = cur.most_common(61)
    out = [(-count, l) for (l, count) in frequent]
    for (count, l) in sorted(out):
      print '%d\t\t%s' % (-count, l),

  c[i] = cur
  i += 1

This is the first few lines of output for 2 (the second occurrence of \D). The first column is the number of times executed, then what's printed by gdb (usually the line number and source line).

2
975726          1067        while ( s > 255 ) {
975726          1069          if ( ( strstart [s + 1 ]- strstart [s ]) == len ) {
975564          1077          decr ( s ) ;
25596           1034      if ( ( strstart [s + 1 ]- strstart [s ]) != ( strstart [t + 1 ]-
13932           1039      while ( j < strstart [s + 1 ]) {
13770           1041        if ( strpool [j ]!= strpool [k ])
12798           1033      result = false ;
12798           1035      strstart [t ]) )
12798           1037      j = strstart [s ];
12798           1038      k = strstart [t ];
12798           1047      lab45: Result = result ;
12798           1048      return Result ;
12798           1049    }
12798           1071            if ( streqstr ( s , search ) )
12636           1042        goto lab45 ;
12636           zsearchstring (search=9213) at pdftex0.c:1077
11988           54        while (*p != 0  && !(brace_level == 0
6804            37        while (*key != 0)
6480            44          n = (n + n + TRANSFORM (*key++)) % table.size;
5994            55                             && (env_p ? IS_ENV_SEP (*p) : IS_DIR_SEP (*p)))) {
5832            56          if (*p == '{') ++brace_level;
5832            57          else if (*p == '}') --brace_level;
5832            62          p++;
3429            9427      lab20: curcs = 0 ;
3429            9428      if ( curinput .statefield != 0 )
3417            9909      else if ( curinput .locfield != -268435455L )

Compare with that for 4 (the third occurrence of \D, after tikz is loaded):

4
2502900         1067        while ( s > 255 ) {
2502900         1069          if ( ( strstart [s + 1 ]- strstart [s ]) == len ) {
2502738         1077          decr ( s ) ;
29322           1034      if ( ( strstart [s + 1 ]- strstart [s ]) != ( strstart [t + 1 ]-
16038           1039      while ( j < strstart [s + 1 ]) {
15876           1041        if ( strpool [j ]!= strpool [k ])
14661           1033      result = false ;
14661           1035      strstart [t ]) )
14661           1037      j = strstart [s ];
14661           1038      k = strstart [t ];
14661           1047      lab45: Result = result ;
14661           1048      return Result ;
14661           1049    }
14661           1071            if ( streqstr ( s , search ) )
14499           1042        goto lab45 ;
14499           zsearchstring (search=18640) at pdftex0.c:1077
11988           54        while (*p != 0  && !(brace_level == 0
6804            37        while (*key != 0)
6480            44          n = (n + n + TRANSFORM (*key++)) % table.size;
5994            55                             && (env_p ? IS_ENV_SEP (*p) : IS_DIR_SEP (*p)))) {
5832            56          if (*p == '{') ++brace_level;
5832            57          else if (*p == '}') --brace_level;
5832            62          p++;
3429            9427      lab20: curcs = 0 ;
3429            9428      if ( curinput .statefield != 0 )
3417            9909      else if ( curinput .locfield != -268435455L )

and for 6 (the last one, after xlop is loaded):

6
2603826         1067        while ( s > 255 ) {
2603826         1069          if ( ( strstart [s + 1 ]- strstart [s ]) == len ) {
2603664         1077          decr ( s ) ;
29646           1034      if ( ( strstart [s + 1 ]- strstart [s ]) != ( strstart [t + 1 ]-
16200           1039      while ( j < strstart [s + 1 ]) {
16038           1041        if ( strpool [j ]!= strpool [k ])
14823           1033      result = false ;
14823           1035      strstart [t ]) )
14823           1037      j = strstart [s ];
14823           1038      k = strstart [t ];
14823           1047      lab45: Result = result ;
14823           1048      return Result ;
14823           1049    }
14823           1071            if ( streqstr ( s , search ) )
14661           1042        goto lab45 ;
14661           zsearchstring (search=19263) at pdftex0.c:1077
11988           54        while (*p != 0  && !(brace_level == 0
6804            37        while (*key != 0)
6480            44          n = (n + n + TRANSFORM (*key++)) % table.size;
5994            55                             && (env_p ? IS_ENV_SEP (*p) : IS_DIR_SEP (*p)))) {
5832            56          if (*p == '{') ++brace_level;
5832            57          else if (*p == '}') --brace_level;
5832            62          p++;
3429            9427      lab20: curcs = 0 ;
3429            9428      if ( curinput .statefield != 0 )
3417            9909      else if ( curinput .locfield != -268435455L )

1.3.5. Comparing the output

We can just do this visually, by say opening each in a separate tab and switching between them. For example, the (most frequent) while ( s > 255 ) { loop or test is performed 2502900 times after tikz.tex is loaded, compared to 975726 times before. Everything after (less frequent than) zsearchstring is run the same number of times (among statements executed at least 500 times say), and everything above that is from inside the zsearchstring function, or from the zstreqstr function just above (called from zsearchstring). So the culprit is entirely this zsearchstring function in pdftex0.c.

If we understand what this zsearchstring is and why it's called, it concludes the debugging process.

2. Understanding what we found

If you skipped the previous section: so far we've found that all the additional work between different calls of \D happens in the function zsearchstring in pdftex0.c, which seems to be invoked more times (and executes a lot more operations) as more packages are loaded. Why?

2.1. What is `zsearchstring`?

2.1.1. Locating it in the source code

We can see the entire definition of zsearchstring in pdftex0.c or for that matter in tex0.c (which are both inside Build/source/Work/texk/web2c/ in the texlive directory):

strnumber
zsearchstring ( strnumber search )
{
  /* 40 */ register strnumber Result; searchstring_regmem
  strnumber result  ;
  strnumber s  ;
  integer len  ;
  result = 0 ;
  len = ( strstart [search + 1 ]- strstart [search ]) ;
  if ( len == 0 )
  {
    result = 345 ;
    goto lab40 ;
  }
  else {

    s = search - 1 ;
    while ( s > 255 ) {

      if ( ( strstart [s + 1 ]- strstart [s ]) == len ) {

        if ( streqstr ( s , search ) )
        {
          result = s ;
          goto lab40 ;
        }
      }
      decr ( s ) ;
    }
  }
  lab40: Result = result ;
  return Result ;
}

But at first glance it does not appear to be used anywhere else in the file. That's because in pdftexcoerce.h (or texcoerce.h) you'll find a declaration and a macro defined as it:

strnumber zsearchstring (strnumber search);
#define searchstring(search) zsearchstring((strnumber) (search))

and you can indeed find searchstring used a few times in pdftex0.c or tex0.c.

This C code is somewhat harder to read than necessary though. In fact, the list of files where searchstring is found includes tex.p, which is presumably the result of tangling tex.web. Yet if you look in the TeX source code (with texdoc tex say), you will not find this function, as it's not part of the code that Knuth wrote. It's instead part of the “system-dependent changes” — changes made in web2c to produce a working TeX program. Instead you need to look at the “complete” (pdf)TeX source, with the change files too. Something like the following (assuming that texlive is the texlive directory):

weave Build/source/texk/web2c/pdftexdir/pdftex.web Build/source/Work/texk/web2c/pdftex.ch

to produce a pdftex.tex file, followed by pdftex pdftex.tex (after optionally changing the \input webmac to \input pdfwebmac). (One could also look in the .ch file directly, but WEB code is ugly and is best not looked at directly.)

Now we can look in the resulting PDF for search_string.

2.1.2. The definition, documentation, and usage of `search_string`

Here's the definition of search_string; compare with our earlier zsearchstring C code above (generated from this):

This finally explains what search_string is and why it exists. (We'll say more below.) Looking at places where it's used makes things even more clear. It's used in three procedures: end_name, start_input, and slow_make_string. Let's look at the first two:

Compare with the corresponding sections in the TeX program: §517 and §537, which don't use search_string after calling make_name_string. It's worth looking at the definition (at least the documentation / context) of that too:

2.2. Relation to file operations

We saw above that these functions are called “when scanning a filename in an \input, \openin or \openout operation”. This of course includes the \includegraphics{foo.pdf} example in the question, and the \openin in the reduced example.

Note that sometimes scanning does not require creating a string: we can see this by changing the test case to:

\toks0={foo.pdf}
\def\A{\setbox0\hbox{\openin0\the\toks0\closein0\the\toks0}}

where the phenomenon is not observed. (Well I haven't run it through gdb, but the numbers do not increase.)

Also see why working with filenames required system-dependent changes in the first place — at the time TeX was developed, file names were very inconsistent across operating systems; in fact in the place where TeX was developed (SAIL), file names consisted of a “base”, “extension” and an “area” that included the user's initials and project (or something like that).

2.3. What is the string pool, etc?

Some background, for understanding the context of the code we saw above. At the time Knuth was originally (re)writing TeX (1980–1982), the programming language Pascal (at least the version available to him and at many places where TeX was going to be used) did not have good support for strings. So TeX basically takes care of allocating all strings manually: there's a giant array of characters called str_pool, initialized at the start of the program, and whenever TeX needs to store a new string, it stores the characters of the new string (as it's being built up, e.g. scanned from the input file) at successive indices into this array. For example, the kth string starts at str_pool[str_start[k]] and goes up to str_pool[str_start[k+1]-1]. Or you can read this in the program:

Note that the string pool is just an array, and is not optimized for finding strings in it: the TeX program as originally written saves references to whatever strings are needed (e.g. it will save “k”, and thereby know where to find the kth string). It never needs to look through all strings in the array for a particular string, any more than it's reasonable to search through all bytes of a computer's memory looking for a particular value.

But when the system-dependent changes for web2c were made (a long time ago), a function slow_make_string was introduced which before saving a string, searches the entire string pool(!) to see whether it's already present under some other name (number). If so, the same string (number) is reused. This explains the very frequent (executed millions of times) loop of

while ( s > 255 ) {
  if ( ( strstart [s + 1 ]- strstart [s ]) == len ) {
  ...
  }
  decr ( s ) ;
}

that we saw in gdb: it's searching through all string numbers s, starting at the largest (most recent) value.

It appears that this may have made sense when memory constraints were tighter than time constraints (you can always wait longer), especially as it would also mean the string pool was small so there was a smaller limit on how much time would be spent searching through the entirety of it. At current memory sizes (and memory access times, which have over the last many decades consistently become more expensive relative to (arithmetic) CPU instructions) it may be worth reconsidering...

(TeX as originally written does not do this. From some of the documentation it's worded as though TeX simply creates this new string, stores no reference to it, and moves ahead, which sounds like a typical memory leak bug — possibly worthy of one of those reward cheques from DEK? :P — but from looking at some of the code it seems rather that TeX unconditionally flushes the string, so it's rather the case that the changed (web2c) TeX wants to preserve a reference for some reason, so it needs this workaround... it's not clear to me which is the case.)

2.4. Other TeX distributions

Apart from TeX Live, I took a look at MiKTeX, and it has nearly identical code for these sections. (Just renamed from “54/web2c-string” to “54/MiKTeX-string”.) I have not been able to look at other less common (not based on web2c) TeX distributions, like KerTeX or TeX-gpc, nor of course of closed-source (commercial) distributions like BaKoMa TeX or Texpad.

2.5. Seeing string pool usage

At the end of a TeX run, if \tracingstats=1, the program prints statistics to the log file (“Here is how much of TeX's memory you used”). These are the results by moving the \bye to different places in the above file (after adding \tracingstats=1):

At the top of the file (just after \tracingstats=1):

5 strings out of 495042
126 string characters out of 6159513

After \A, \B, \C, \D have been first defined:
```
5 strings out of 495042
126 string characters out of 6159513
```
(Doesn't change because single-letter names are not stored separately.)
After those and also \filename has been defined:
```
6 strings out of 495042
134 string characters out of 6159513
```
(Makes sense: \filename is one string, and 8 characters long.)
Just before the first \D:
```
8 strings out of 495042
153 string characters out of 6159513
```
(The two strings of 19 bytes total are not the primitives \pdfresettimer and \pdfelapsedtime (those would be already stored), but rather something created by \the\pdfelapsedtime. Not sure of the details.)

Just after the first \D, or just after the first \message:

9 strings out of 495042
156 string characters out of 6159513

After \input xintexpr is loaded (and any place before \input tikz):

3815 strings out of 495042
61780 string characters out of 6159513

After \input tikz:

13243 strings out of 495042
266711 string characters out of 6159513

(Note the large increase compared to earlier.)

After \input xlop (or end of file):

13866 strings out of 495042
274144 string characters out of 6159513

These relative increases in the size of the string pool roughly match the relative increases in the time for executing \D.

3. Summary / answer

In the common implementations of TeX, commands in which TeX scans for a file name (as in the case of \includegraphics) involve searching through the entire string pool, and this gets slower as more packages are loaded because the packages define control sequences (macros) whose names are stored in the string pool.

This should I think be raised on the TL list or perhaps the pdfTeX dev one — Joseph Wright, Sep 10 '18 at 11:13
@JosephWright Thanks, you're probably right. However, I'm hesistant to do this (maybe someone else can), for two reasons: (1) As the above long story shows, most of this is stuff I only learned recently, and I am not confident I really understand things well enough to post there, (2) There may also not be any “real” problem: in this example with 10000 uses of \includegraphics and some reasonably nontrivial packages loaded the measureable delay was less than 10 seconds, which amounts to less than a millisecond per such a file-operation command. Computers are fast! So the impact is small... — ShreevatsaR, Sep 11 '18 at 01:54
The 2020 tune-up adds a note to §537 (the definition of start_input): “For historic reasons, this code foolishly conserves a tiny bit of string pool space”. — texdr.aft, Feb 10 '21 at 04:43