6

(Edit: Somehow an addition I made to this question got lost before I pressed the bounty button. Apologies to whoever already answered partially.)

I occasionally come across old(ish) papers written in LaTeX, from around 20-25 years ago, such as:

  • this one: Made into a PDF with Aladdin GhostScript; so probably it was tex->dvi->ps->pdf or something like that.
  • this one: tex->dvi->ps->pdf , using dvips and then Acrobat Distiller 3.01 for Windows

Anyway, the on-screen legibility with several PDF readers I've tried is usually poor. Is it possible to reprocess the file somehow so as to improve it?

Specifically, is it possible to...

  1. Manipulate/massage the bitmap fonts to improve its readability?
  2. Determine which fonts (families, weights, sizes) are used - assuming it's one of the more commonly-used fonts and not something esoteric - and replace the bitmap font glyphs with scalable, hintend font glyphs?
  3. Extract the text of words/lines and re-typeset it using a more legible font?

Of course, if the authors are reachable and have the sources, you could just get them and rebuild (well, sort of); but let's assume this is not an option and we only have the PDF to work with.


Here's some more information regarding the fonts in the two example files:

$ pdffonts P29.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  no     173  0
[none]                               Type 3            Custom           yes no  no     166  0
Courier                              Type 1            Standard         no  no  no     471  0
Courier                              Type 1            Standard         no  no  no     470  0
Helvetica                            Type 1            Standard         no  no  no     122  0
[none]                               Type 3            Custom           yes no  no     123  0

$ pdffonts ng.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  no       4  0
[none]                               Type 3            Custom           yes no  no       5  0
[none]                               Type 3            Custom           yes no  no       6  0
[none]                               Type 3            Custom           yes no  no       7  0
[none]                               Type 3            Custom           yes no  no       8  0
[none]                               Type 3            Custom           yes no  no       9  0
Helvetica-Bold                       Type 1            Standard         no  no  no      15  0
Times-Bold                           Type 1            Standard         no  no  no      16  0
Times-Italic                         Type 1            Standard         no  no  no      17  0
Times-BoldItalic                     Type 1            Standard         no  no  no      18  0
[none]                               Type 3            Custom           yes no  no      22  0
[none]                               Type 3            Custom           yes no  no      23  0
[none]                               Type 3            Custom           yes no  no      24  0
[none]                               Type 3            Custom           yes no  no      28  0
[none]                               Type 3            Custom           yes no  no      29  0
Times-Roman                          Type 1            Custom           no  no  no      52  0
Times-Italic                         Type 1            Custom           no  no  no      53  0
Times-BoldItalic                     Type 1            Custom           no  no  no      54  0
[none]                               Type 3            Custom           yes no  no      55  0
einpoklum
  • 12,311
  • Maybe the fonts are not embedded and you need to get the correct fonts to display them correct. – knut Feb 24 '18 at 19:20
  • @TeXnician: I don't, of course, otherwise I wouldn't have asked - I would just reproduce it from sources. – einpoklum Feb 24 '18 at 19:20
  • @knut: Maybe. How can I tell? ... – einpoklum Feb 24 '18 at 19:20
  • Sorry, this is not the area of my experience :) but maybe somebody else can help. Which OS do you use? Which PDF-viewers have you available? Maybe this helps to find a solution: https://tex.stackexchange.com/a/109460/6563 https://tex.stackexchange.com/questions/359288/which-fonts-is-texworks-using-to-replace-missing-system-fonts – knut Feb 24 '18 at 19:29
  • @knut: I use Linux, with xreader, evince and okular , but I don't think that matters all that much. Are you getting other results on Windows? Also, please don't send me trying to replace fonts when we we don't know that's the problem. – einpoklum Feb 24 '18 at 19:59
  • @DavidCarlisle: see edit. – einpoklum Feb 24 '18 at 20:28
  • 1
    sorry yes I noticed you'd linked to the pdf and was getting the same. sort of odd collection, it's the type3 (bitmap) fonts that are the issue but they are oddly nameless subsetted ones, used in conjunction with helvetica and courier, not the expected tex bitmap fonts (for which you could have tried to substitute equivalent type1) so not sure you can do much. – David Carlisle Feb 24 '18 at 20:30
  • I also missed you linked the pdf - my question with just to give you hints, how you can find font information. But you alread did (-> +1) – knut Feb 24 '18 at 20:36
  • 3
    however possibly more useful is that the pdf also has the email of the author which google suggests is still current you could just ask if the document is available in a newer format.. – David Carlisle Feb 24 '18 at 20:38
  • 1
    BTW if you print it out it may look fine. (Often the problem with bitmap fonts isn't so much their resolution itself, but the fact that they don't come with any hinting information for screen… on a printer there's enough resolution that things look better.) – ShreevatsaR Feb 24 '18 at 21:45
  • @ShreevatsaR: That's the thing. If it can look fine when printing, it can look fine on screen, by using the fonts intended for printing. Right? – einpoklum Feb 24 '18 at 21:56
  • 3
    no, the problem with bitmap fonts is that they work best at the exact resolution they are designed for (eg a 300dpi laser printer) but on screen the viewer has to sample the bitmaps to the resolution and zoom that you are using, so you get all sorts of artifacts (although some viewers are better at sampling than others) – David Carlisle Feb 24 '18 at 22:02
  • @DavidCarlisle: Any reasonable rendering engine would apply anti-aliasing on the fixed-resolution font and show a pleasant output on a screen. But regardless - if these bitmap fonts are positioned, and named, it should be theoretically possible to replace them with the parametric curve versions of the same fonts, either on-the-fly or by some reprocessing. – einpoklum Feb 24 '18 at 22:10
  • @einpoklum That's exactly my point (also what David said); with a bitmap font the rendering engine does not have enough information (“hinting”) to do anti-aliasing or the other tricks it likely does for vector fonts. (Spend a few hours reading this site, say: http://www.rastertragedy.com/ for a lot more on what's typically going on.) So things that look perfectly fine in print can look poor on screen. (There are no separate “fonts intended for printing” as you said: it's the same font that gets printed well and rendered poorly.) BTW the PDF in the question looks pretty decent on my screen. – ShreevatsaR Feb 24 '18 at 22:22
  • @ShreevatsaR: It doesn't need more information to do anti-aliasing. Hinting can help it, but with a 300 DPI bitmap it can anti-alias just fine: For a given pixel to be rendered, calculate its fractional coordinates in the 300 DPI bitmap and integrate over its surroundings with, say, an appropriately-scaled 2-d sinc weight function. I'm not an image processing expert but that should be good enough. – einpoklum Feb 24 '18 at 22:27
  • 1
    @einpoklum No, that's not good enough actually. You're overestimating the number of pixels available on a typical screen at a typical zoom level, the amount of allowable distortion before a typical human eye detects differences etc. If you simply interpolate each pixel using its surroundings, you just get blurry text. There is decades of work on text rendering; it's not so simple. Note BTW that on your screen you have far fewer pixels than the 300 dpi or whatever the font assumed (print resolution is higher), so the problem on screen is how to render a pixel based on many pixels of the font. – ShreevatsaR Feb 24 '18 at 22:49
  • 1
    replacing the fonts by scalable versions was what I was expecting to suggest but since pdffonts says the fonts are unnamed and in a custom encoding I have not any idea how you could determine a replacement. – David Carlisle Feb 24 '18 at 22:55
  • @ShreevatsaR: I did say sinc filter rather than interpolation. On the other hand, the 300 dpi bitmaps look rather horrid when I zoom in, so - I don't know, maybe you're right. PS - I'm not totally ignorant, I know typical monitors are 96 dpi. – einpoklum Feb 24 '18 at 23:45
  • @einpoklum Not any more they aren't. Though if you are using an out-of-the-box Linux distro with X11, you probably are using 96 DPI on a screen which may be designed for something completely different. You can stop this happening, but you typically have to do this quite actively. You don't even need an especially fancy screen (e.g. real HD or quad or whatever) for 96 to be wrong - just higher HD than 96. – cfr Feb 25 '18 at 03:16
  • @ShreevatsaR It isn't necessarily true that the fonts on screen are the ones used to print. If the fonts are not embedded, this can be because they are in the 'standard' set traditionally available on printers. In that case, you might get the printer's fonts in the print-out, but OS/viewer substitutes on screen. It depends on the printer, the viewer, the printing system, the options and driver etc. and not just the PDF. At least, I don't know as much about it as you, but that was my understanding. Some Mac font formats wrapped screen and print fonts, too. – cfr Feb 25 '18 at 03:19
  • @cfr Oh good point, you're right, I wasn't thinking about the “standard” set where the viewer and printer can independently substitute different fonts. – ShreevatsaR Feb 25 '18 at 04:00
  • @DavidCarlisle: Taking you back to your comment from February - can't we just 1. guess, 2. Try to match with some known popular fonts or 3. Reconstruct curves based on the rendered bitmaps? – einpoklum Aug 29 '18 at 07:50
  • @einpoklum you could yes, but given anonymous type3 bitmap fonts in a pdf I would either give up or mail the author and ask for document source rather then try to recreate an egg from an omelette:-) – David Carlisle Aug 30 '18 at 08:35
  • @DavidCarlisle: Ok, but let's assume the author is unreachable/can't be bothered/doesn't have the sources. – einpoklum Aug 30 '18 at 17:57
  • then i would use the first option that I suggested (I don't say you should, but I would give up and accept the document as it is, just as I'd accept a scan of typewriter document, it may not be pretty or scale well, but it is what it is) – David Carlisle Aug 30 '18 at 17:58
  • One very big question: Does it have any diagrams or mathematical formulas with complex typesetting? – Davislor Aug 30 '18 at 23:29
  • @Davislor: Maybe, but you can ignore them. Or rather, for starters, you may assume there aren't, and a better solution would just leave them alone and focus on the text proper. – einpoklum Aug 31 '18 at 00:07
  • @ShreevatsaR: Just FYI - You have about 5 hours to post an answer - if even a partial one - to be eligible for the bounty. – einpoklum Sep 04 '18 at 10:05
  • I'll pass :-) Probably a project for next weekend. – ShreevatsaR Sep 04 '18 at 13:33

1 Answers1

1

I took the first page of the document. I opened it in Preview on macOS. I took a screen snapshot of that page without the page number. I created a new document in Preview using the screen snapshot. I saved the document as a PDF. I uploaded the PDF to GDrive. I used Google Docs to open the file. I opened the Google Docs file and printed it back to PDF. Here is an image of the results.

image of results page

This is a quick and dirty effort. The column ordering is lost. I imagine that significant improvements are possible to avoid this problem by paying closer attention to the snapshot (i.e. take one column at a time) and/or by using a professional-level application. To that end, I tried using PDFElement 6 Pro on the source PDF and its printed PDF copy. The conversion caught ONLY the page number. A fee is needed to test the OCR option of the app (and Google provided the conversion for free for a proof of concept).

I hope this demonstration of a quick and dirty effort to improve the readability of fonts in an "old" document provides enough satisfaction to suggest it as a viable answer.

In short, it is possible. Absent the .tex file, the trick is to run an OCR on an image of the document to re-convert the fonts to today's standards.

  • OCR, eh? Hmm. Interesting. It's too "costly" for me to lose the document structure and the less- or not-OCR'able material in the paper, but I suppose this could be a basis for a solution. I mean, if we were to somehow take the textual regions, OCR them (including inline LaTeX formulae), and render them back into the same location. But at the moment, what you suggest improves one aspect of readability at the expense of all others :-( Also - no need to update your answer due to the edit. – einpoklum Aug 29 '18 at 07:35
  • With a document that has two columns and math, OCR may cost more than one is willing to spend in time or expense. The fix that I see is that it erases issues of sloppy line-justification that make reading difficult. I suspect that no other approach will handle this problem. I am curious whether a commercial OCR package can render text "back to the same location" (i.e. in a two-column layout). As for math, tools exist to convert hand-written math to MathJaX/LaTeX. In any case, perhaps a scan of the document at 600dpi+ can afford at least better resolution the fonts than the raw PDF. – Jeffrey J Weimer Aug 29 '18 at 13:11
  • If memory serves me correctly, This was happening with commercial OCR (I think Adboe's) over 15 years ago, and you could get MS-Word documents with different frames on the page where the text was supposed to be. – einpoklum Aug 29 '18 at 17:47
  • So, you got the bounty. As you already know, I'm hoping for something beyond your suggestion - so I'm not accepting the answer - but it's certainly something, and you made the effort, so thanks again and enjoy your 25% reputation boost :-) – einpoklum Sep 04 '18 at 13:56
  • I'm also curious whether a something else is possible. One might hope for an app that would for example swap the fonts in the PDF directly. In any case, the bounty is appreciated. :-) !! – Jeffrey J Weimer Sep 04 '18 at 14:48