5

I'm reproducing a book from its archive.org PDF using latex. For QA purposes I'd like to have a side by side display of the original page and the latex-typeset page with the requirements that :

  1. synctex works, so I can jump directly from pdf to source to fix issues (I use TeXstudio).
  2. page numbers (e.g. folio, not pdf-pagenum) are not changed.
  3. The typesetting is unaltered by the inclusion of the image.

However, every approach I've tried so far has one or more issues, so here I am looking for sage advice.

More info on the workflow

Generating the .tex source is largely automated using scripts. I've settled on having one pageXXX.tex per page, collected into directories by parts and chapters which I can work and typset in isolation. original linebreaks are preserved by inserting \linebreak everywhere. Page breaks are implcitly handled by matching the page geometry and font sizes. Including some custom tex code on each page is not a problem.

What I've tried (in the order I've tried them)

  • Method #1: use pdftk to interleavs pages from the original.
    • Advantages:
      • latex typesetting is unaffected.
      • Simple to get working, no changes to .tex files needed
    • Problems:
      • Breaks synctex (as I recall) because the page numbers are changed .
      • Before correction its common that a single page.tex spills over to the next page quite often. Everytime that happens, all subsequent page images get shifted w.r.t to the typset page number and appear at the wrong location.
  • Method #2: call pdfpages in every pageXXX.tex and insert the original page after the page.
    • Advantages:
      • The sync between page image and page is preserved because a specific page number is requested on every new pageXXX.tex file,
    • Problems:
      • Calling pdfpages forces a new paragraph which alters the typesetting.
      • Sometimes results in ghost pages or orphans that otherwise wouldn't occur.
      • Strange interactions with lettrine: pageWithDropCap+pdfpage sometimes results in a dropcap-shaped "hole" on the following page.
      • pdfpages modifies the page counter, which has to be corrected for by more, brittle, hacks.
  • Method #3: Double the page width and use the extra space for the image, using atbegshi+[absolute]textpos+\includegraphics
    • Advantages:
      • page numbers and typography are unmodified
      • Each page asks for a specific image, no sync drift issues if a page overflows.
    • Problems:
      • I'm including a \AtBeginShipoutNext at the top of every pageXXX.tex, but actual shipout occurs when latex decides to do it. It seems like sometimes that's after the last line of a pageXXX.tex, and somtimes only after looking at the first line of the next pageXXX.tex, without including that line in the page. The result is that \AtBeginShipoutNext get squashed together and apply to a single page, with unfortunate results.

I'm open to any alternatives, but this last method should work perfectly if only I could ensure a (single) shipout at the end of each page.tex. I've tried manually including \pagebreak[4] but this sometimes results in extra blank pages depending on whether latex has already decided the page is full and shipped it or not.

I've also tried using the needspace package to try and improvise an "idempotent" pagebreak, but things didn't seem to work as expected (spurious paragraph breaks, vertical spacing issues)

I've included as much detail as possible in hope that the information will be useful to others working on similar projects.

  • You want to weave pages from one document into another while preserving SyncTeX -- do I understand the whole of it? (On another note -- I'm not sure it's worthwhile to try to reproduce a work with LaTeX. It may be better to use Plain TeX (or ConTeXt, with which I've not played). The PLOS fiasco at least brought that particularity to light. – Sean Allred Jul 07 '15 at 17:20
  • 1
    I prefer method #1 split the book to pages if you are ok with one page you'll be ok with the whole document – touhami Jul 07 '15 at 17:24
  • @SeanAllred, not quite. Until everything is perfect, some pages tex file may spill into an extra page, which would throw off the correspondence if I just interleave the pages. – Jared Kulik Jul 07 '15 at 17:32
  • @touhami, what about paragraph breaks? also compilation times would probably explode. Last resort. – Jared Kulik Jul 07 '15 at 17:33
  • may be i don't understand, can you please explain more. suppose you have one page pdf file what is exactly the problem? – touhami Jul 07 '15 at 20:54
  • pretend it's one big tex file. In reality it's spread over many individual files for organizational reasons and in order to make per-page tweaks easier to implement. – Jared Kulik Jul 07 '15 at 21:32
  • @SeanAllred, I looked into the idea of preserving SyncTeX across postprocessing. The format isn't very complicated but there are no libraries for writing synctex. Figuring out everything else is possible, but there are no ready to use libraries for rewriting SyncTeX. So its a detour. I'd rather have latex do it for me. – Jared Kulik Jul 07 '15 at 21:43
  • minipage and samepage may be helpfull http://tex.stackexchange.com/questions/30734/how-make-sure-two-elements-stay-on-the-same-page – touhami Jul 07 '15 at 22:28

1 Answers1

2

Following up on Sean's idea, I've put together a working solution that postprocesses the pdf output and then updates the .synctex file.

Using the original SyncTex data (and the C parser provided by jerome laurens), it's easy to figure out where the first line of each pageXXX.tex ended up in the pdf output, and from that the 2up version is easily generated by merging whole page from the latex version pdf and the original page-image pdf.

Generating an updated .synctex file was also relatively simple. Because entire pages are preserved in the output, and only their pdf page number changes, it's simply a matter of updating sheet numbers in the file. The spec for SyncTeX was useful and so was the simplicity of the format (hurray for text-based and line-oriented formats!).

With that done, everything works beautifully, the latex output is untouched, pdf/source sync works perfectly, pages are matched automatically for output, and it takes less than a second to generate.

Output Example (Latex pages left, Images right):

enter image description here