1

InftyReader is an OCR package from SciAccess that can be used to produce TeX output that looks like this:

\documentclass[a4paper,10pt]{article}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bm}
\usepackage{graphicx}
\usepackage{wrapfig}
\usepackage{fancybox}
\pagestyle{plain}

\begin{document}
\begin{center}
\includegraphics[width=177.97mm,height=6.35mm]{image001.eps}
\end{center}
64
\begin{center}
\includegraphics[width=177.97mm,height=22.78mm]{image002.eps}
\end{center}
\Cyrillic_U\Cyrillic_De\Cyrillic_Ka

This appears to be an older Cyrillic encoding method using TeX commands that start with \Cyrillic_. I have so far been unable to find the package where these are defined, although the preamble I've copied suggests that they are using AMSTeX.

I'm hoping to avoid writing a 66-pass Word macro (one pass for each character, upper and lower case) to convert this into something that's human-readable.

Does anyone have any insight here on what package is being used? The OCR looks good, with very few, if any, instances of confusion between Cyrillic and Roman-alphabet text, and the software appears to do very well at identifying inline and display math, so it would be wonderful to find an easy way to convert this into readable text.

Mensch
  • 65,388
  • do the names correspond to those at wikipedia, "Letters of the Cyrillic alphabet" ? it is possible to let LaTeX handle _ like a letter and provide suitable definitions to either map to Unicode Cyrillic or directly to the LaTeX input character representation (\CYR... type of macros). (using a delimited macro \Cyrillic_ appears more complicated because the U, De, Ka do not appear to be self-delimited). What is your goal? get this to be LaTeX compilable or convert the source to genuine Cyrillic? –  Jan 02 '19 at 09:35
  • have you already asked or searched InftyReader site ? –  Jan 02 '19 at 09:42
  • jfbu: Genuine Cyrillic would be preferred. No, й is rendered \Cyrillic_ikratkoe and ь is rendered \Cyrillic_myagkiiznak. – Paul Makinen Jan 02 '19 at 23:20
  • I did send an inquiry to SciAccess tech support. Have not heard back yet, but that's not surprising given the holidays and the time zones between here and Japan. – Paul Makinen Jan 02 '19 at 23:23
  • At their site I saw some 200MB or 300MB "TeX" bundle offered for download it seems to contain (some old) TeX installation with various software such as dviout. Not sure if that would contain the LaTeX support files. Anyway, if I was confronted with the problem I would write a shell script using sed to do all 66 passes and replace all macros by Unicode letters. Then if processing with pdflatex you will need T2A and possibly T2B and T2C for fontenc and perhaps also to add some automatic font encoding substitution as in this recent answer –  Jan 03 '19 at 09:34

0 Answers0