1

I'm working on a document which incorporates a lot of figures and tables etc. The PDF file size is creeping up and up, to the point that github complains about it now too.

I've already got a list of the largest images etc (from ls) which are the likely culprits of contributing to the file size. There are other elements in the document too however, such as To-Do notes, and on-the-fly typeset images a-la Tikz and it's ilk, which can't be calculated in the same way.

Is there an easy way to tell what is contributing to making a TeX PDF file large? Is there anything in the auxiliary files?

  • 1
    Don't you mean 'PDF file' in the last sentence? – albert May 15 '18 at 17:29
  • 3
    Not an answer: Why do you push the reproducible results of the typesetting to github? If you include all the source separately there is no need to also push the PDF. Most likely your images are the biggest culprits. To-Do notes won't take up much as they should be only text. The results of TikZ shouldn't be as big as a pixel graphic either (this of course highly depends on the number of paths in your TikZ foo). – Skillmon May 15 '18 at 17:29
  • 1
    You can get a rough estimate on how much your pictures increase the size of your PDF by using the the draft option for the graphicx package. – Mike May 15 '18 at 21:28
  • I’m using GitHub as a backup of the whole project, not just a source code repo. Having the pdf also backed up and available online is useful for sharing the document with others who are not tex-savvy. That’s by-the-by anyway as it doesn’t really affect the outcome of this question – Joe Healey May 16 '18 at 07:50
  • @albert , no I don’t think I did mean pdf? I’m wondering if any of the aux files (that I’m not very familiar with) have information about the size of graphics or document elements etc. – Joe Healey May 16 '18 at 07:53
  • I was thinking of the .tex file, so a better formulation probably is another TeX file like aux, idx, log, ... – albert May 16 '18 at 07:59
  • Oh sorry I follow you now, yes I did mean the PDF! – Joe Healey May 16 '18 at 08:01
  • 1
    If you want to keep your file size low, don't crop pictures using graphicx cropping methods but crop the files (might be good keeping the originals as backups). Also scale their resolution down to a resolution appropriate for their printing size (no need to include an image at 900 dpi if it gets printed with half of its natural size, 150-300dpi seem enough here). The human eye has its limitations on the resolution. – Skillmon May 16 '18 at 11:26
  • Ah interesting. graphicx is 'keeping' the cropped areas? This is definitely something that will be costing me in the document at the moment. Rather than have to go back and redo all the figures, is there an option to have those regions 'removed'? – Joe Healey May 16 '18 at 11:28
  • As far as I know it does keep those cropped areas in the PDF internals. But I'm really not an expert on graphics. I'm not aware of an already implemented method to save the cropped image files automatically, nor do I know how the graphics bundle could be configured to leave out the data. – Skillmon May 16 '18 at 11:33
  • @Skillmon it is very hard to make a smaller pdf by cropping. latex certainly does not try, it just specifies a clippath so nothing is rendered outside the specified rectangle. – David Carlisle May 16 '18 at 12:23
  • @JoeHealey Depending on your file, postprocessing the pdf with gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=foo-compressed.pdf foo.pdf can dramatically reduce the file size. – samcarter_is_at_topanswers.xyz May 16 '18 at 12:50
  • @samcarter well that's certainly worth knowing! That command alone reduced the PDF file size by around 50% (~25Mb) – Joe Healey May 16 '18 at 13:28
  • No one in the comments seems to have asked the obvious: what is the total size of the images (.jpg and .png) on disk prior to compiling into the pdf? How large is the largest of these files? – Aubrey Blumsohn May 16 '18 at 17:32
  • @AubreyBlumsohn I'll have to get back to you with a slightly more accurate total size (as some images are in the directory hierarchy but arent incorporated in to the doc yet), but crudely, there are roughly ~150 Mb of images (PNG/PDF/JPG) so far (many more to come). The largest image currently is approximately a megabyte. with ~30 images over 900kb in size, though most of these large ones are among the not yet included. I will be including them soon though, which part of why I'm trying to head the inflating file size off early as it will skyrocket soon. – Joe Healey May 16 '18 at 17:52
  • You need to think about each image individually for an image intensive document. If you are printing on an average printer then if the image is a full page it should really be no larger than about 300pixelsx300pixels per printed square inch. For on screen viewing 200x200 is more than enough. Use something like irfanview with the web plugin to crank down each image and the compression ratio for each image before it even enters your LaTex project. For typical printing if say 9 images/page little gained by more than about 0.2Mb devoted to each image, set compression for each image by eye. – Aubrey Blumsohn May 16 '18 at 21:18
  • @AubreyBlumsohn, the images I have are already compressed extensively (I'm still experimenting with other options I can use without trashing the quality). The largest images are 2048x2048 (as they were originally hi-res microscopy images), and I've already reduced the color depth to 8 bit, and their DPI values are 72dpi at present. There will be some images I can still improve potentially, but these microscopy images account for a significant proportion of the doc size. In the case that I can't improve their sizes any further, I was hoping TeX might 'tell me' where else I could stand to gain. – Joe Healey May 18 '18 at 08:48

1 Answers1

2

I don't know what your project is composed of, however usually the culprits are the images. As suggested in the comments, you could check how much they contribute by simply compile the .tex file using the draft option.

The source files of TikZ pictures usually don't take too much space but they can when compiled. I always suggest to separate the TikZ pictures from the main file and include them in your document through standalone package in the parent file and standalone class in the child file. In this way you have the possibility to compile single TikZ picture and see how large they are when compiled (of course this is not the only advantage of using standalone).

You said in the comments that you don't use your github repo as a source code repo but as a backup repo but the purpouse of a github repo is exactly to be a source repo. However, even if you use it as a backup, there is no need files created by the compilation, just add them to .gitignore and add a readme with the instruction for the correct compilation. Different thing is the final pdf, you can keep it to share with others.

gvgramazio
  • 2,660
  • Yep I realise that I'm 'abusing' Github a little, however it means I can backup by whole document with a single command pretty much (beats the hell out of copying to a backup drive or something manually). I don't actually have much in the way of Tikz pictures, but I am using the TeXShade package, which takes a while to compile, but I suspect may not be increasing the size much (though I'd like some way of knowing this for sure, if there is one). – Joe Healey May 16 '18 at 09:49
  • As I said, by simply add the .aux, .idx, etc files to .gitignore you can still backup with only a single command. Simply the folder is more clean. It's a good practice, when your document is a book, a thesis, or even an article or a report of a few pages, to structure it as a modular document. It has several advantages, one of them is to know of each part contribute to the total size. However you could achieve this feature only using the standalone solution, with input and include you have other benefits but not this one. – gvgramazio May 16 '18 at 09:57
  • If you are not familiar with modular document read it on wikipedia and read on tex.stackexchange the difference between \input and \include – gvgramazio May 16 '18 at 09:59
  • The document is already modular (it is a thesis with various \input chapters). I don't believe there is anything wrong with how I have built the document. Ignoring the auxiliary files doesn't answer the question though. I'm not concerned by the overall size of the project/repo, I'm interested in ways of keeping the final PDF size down, without having to compress after the fact for instance. I figure it might be a generally useful process if one is able to identify unambiguously the major contributors to the document size (then I could go and compress them some more individually say) – Joe Healey May 16 '18 at 10:07
  • My answer is more focused on how to check which part contribute more to the size of the final pdf. I you want a stupid and obvious answer to your question (I'm saying that the answer is stupid, not your question) is simply to don't insert high resolution images. Since, for large documents, the only culprits possible are the pictures (tikz, png, jpg, eps, or whatever). – gvgramazio May 16 '18 at 10:12
  • However, since you are not interested in keep low the size of the project but only the size of your pdf, the only possible option that is possible imho is to compilate different pdfs. i.e. if you are writing a book, you can produce different pdfs for index, chapters, appendix and bibliography. – gvgramazio May 16 '18 at 10:15