8

I'm more or less in the process of converting old (let's say from the 70's) paper documents into their modern LaTeX version. What I am doing is as follows:

  1. write the whole Latex source from scratch (very long and inefficient)
  2. scan the document, use an OCR tool and translate the subsequent txt file into a tex file

I'm also thinking of using voice recognition software to go a bit faster. Are some of you in the same situation and what would be your advices to accelerate the whole process? The final goal is to share the PDF documents and related LaTeX source on an open archive and give revival to these "about to die" interesting documents.

Edit 1: work has also to be done on figures and diagrams. As far as I know, it is almost impossible to automate this task. So, I'm currently redrawing everything with either Inkscape, TikZ or pstricks.

Edit 2: Tesseract-ocr is willing to help but it is not high priority. Anyway, it looks like Tesseract can be trained.

pluton
  • 16,421
  • 2
    What type of documents are these? Lots of tips on this type of work can be found at http://www.gutenberg.org/. Unfortunately some manual LaTeX work is inevitable. The positive aspect will be that they will be machine readable 50 years from now. Consider also outsourcing spelling corrections to Mechanical Turks at Amazon. – yannisl Apr 11 '11 at 15:25
  • @Yiannis they are mostly old and nicely done lecture notes that are worth sharing. I'm also adding a comment on figures. – pluton Apr 11 '11 at 15:31
  • 1
    You might also want to consider the path OCR -> text -> RST -> LaTeX, if the structure of the document makes this easier. – Brent.Longborough Apr 11 '11 at 16:58
  • @pluton any news about conversion from old docs to LaTeX? I'm interested too in converting some of mine? Any "automated process" since the initial post? – Mafsi Feb 18 '21 at 19:57
  • @Mafsi not really. I was optimistic with Tesseract-ocr because they explicitly stated they would have a look at this problem (OCR to LateX) but I am not aware of any new developments on this. – pluton Feb 19 '21 at 06:22

1 Answers1

3

I have the following workflow:

  1. Scan pages into series of tiff images.
  2. Process them with Scan Tailor for fix orientation, split pages and to get b/w images
  3. Join resulting images to multi page tiff with the tiffcp command from libtiff

Then I use finereader, because result have to be in the RTF format.

But, open source OCR engines like Cuneiformor Tessseract have recently good results and they can export text in HOCR format. HOCR is in fact HTML with information about paragraphs, page and line breaks and other elements of the page. It should be possible to write some script for conversion from this format to LaTeX.

Illustrations are another problem, you can vectorize them with potrace or autotrace. You can use potrace from inkscape. Results are good for illustrations, but I don't know if they are usable for diagrams or graphs.

michal.h21
  • 50,697
  • @michal.h21 Readiris for recognition, also makes fine job (text + pictures), but unfortunately it's not an Open Source soft. – filokalos Apr 11 '11 at 18:09
  • @filokalos Main problem is, if it can export text in some format which can be converted to LaTeX. For example, finereader can export to HTML, but the code is ugly and it would be complicated to convert it to LaTeX – michal.h21 Apr 11 '11 at 18:34
  • @michal.h21 Could you give me some scanned document as an example (kazhanchik(at)yandex.ru)? I will try and will send you the results. – filokalos Apr 11 '11 at 18:58
  • filokalos ok, I have sent some pages to your mail – michal.h21 Apr 11 '11 at 19:10
  • @michal.h21 Check your e-mail. I've sent it. It was recognized in absolutely automatic mode. – filokalos Apr 11 '11 at 19:46
  • @filokalos: can Readiris recognize math? Is it possible to set it up in such a way that it converts the text into latex source (the integral becomes then \int) – pluton Apr 13 '11 at 16:53
  • @pluton Frankly I don't know if it can convert the results in latex source. I exported it in web browser. It recognizes text that is all I know. :))) Send me a file and I will try to recognize it and will send you back. – filokalos Apr 13 '11 at 17:37