Converting Math Symbols from PDF into LaTeX

Question

I am trying to extract Math content from LaTeX generated PDF files. Most extracted symbols get extracted fine. However some, such as \epsilon, \Updownarrow, \simeq use non Unicode codes and others such as \neq use a combination of non Unicode codes.

\epsilon is written using the embedded font SCCPFS+CMMI10 and code 017
\Updownarrow using the embedded font KAXSYH+CMSY10 and code 0x6d (m)
\simeq using the embedded font KAXSYH+CMSY10 and code 0x27 (')
\neq using the embedded font KAXSYH+CMSY10 and codes 0x36 (/) and 0x3d (=)

Before I begin writing a table to map from the glyph code(s) to the equivalent LaTeX I wonder if such a mapping table already exists in the reverse direction for use within LaTeX. After all, somewhere the original \epsilon, \neq etc. would be getting mapped to one or more glyph codes. The combination cases will require position information also, but that should be there too, in the reverse direction.

EDIT: I tried to lookup this information in the font table but there are no entries in GSUB and GPOS. Is that where I should be looking? Is the information really inside the font?

EDIT: I tried looking up the mmap file in a text editor but it is mostly hex. Is there a tool for opening it?

%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (TeXmath-LMR-0)
%%Title: (TeXmath-LMR-0 TeXmath LMR 0)
%%Version: 1.000
%%EndComments
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TeXmath)
/Ordering (LMR)
/Supplement 0
>> def
/CMapName /TeXmath-LMR-0 def
/CMapVersion 1.000 def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
96 beginbfchar
<00> <005C00620069006700630069007200630020>
<01> <005C006D0064006C00670062006C006B0063006900720063006C00650020>
<02> <005C0073007100750061007200650020>
<03> <005C0062006C00610063006B0073007100750061007200650020>
<04> <005C0076006100720074007200690061006E0067006C00650020>
<05> <005C0062006C00610063006B0074007200690061006E0067006C00650020>
<06> <005C0074007200690061006E0067006C00650064006F0077006E0020>
<07> <005C0062006C00610063006B0074007200690061006E0067006C00650064006F0077006E0020>
<08> <005C006C006F007A0065006E006700650020>
<09> <005C0062006C00610063006B006C006F007A0065006E006700650020>
<0A> <005C006D0064006C00670062006C006B006400690061006D006F006E00640020>

EDIT: I looked up the character for \neq and it was composed of two different fonts so unlikely that this information is in one font. Doing a grep in the texlive directory gives some hints:-

% grep -rw neq * | grep -w not
texmf-dist/tex/plain/base/plain.tex:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/generic/enctex/utf8raw.tex:\mubyte \neq ^^e2^^89^^a0\endmubyte % U+2260 not equal to
texmf-dist/tex/generic/ofs/ofs-cm.tex:  \def\neq{\not=} 
texmf-dist/tex/latex/listings/lstlang3.sty:      myfont,n,nat2string,neq,ngon,norm2,normalmap,not,nu_grid,nubspline,%
texmf-dist/tex/latex/sansmath/sansmath.sty:% two lines, but it did not work well (unbold +, bold greek, bad \neq)
texmf-dist/tex/latex/base/fontmath.ltx:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/latex/unicode-math/unicode-math-xetex.sty:  \cs_gset:cpn { not= }    { \neq }
texmf-dist/tex/latex/unicode-math/unicode-math-table.tex:\UnicodeMathSymbol{"02260}{\ne                       }{\mathrel}{/ne /neq r: not equal}%
texmf-dist/tex/latex/unicode-math/unicode-math-luatex.sty:  \cs_gset:cpn { not= }    { \neq }
texmf-dist/tex/latex/breqn/cmbase.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathpazo.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathptmx.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}

Font packages frequently come with a full font table. I don’t know of a program or database that maps U-encoded fonts to Unicode, but one might exist. Some inputenc mappings go the other direction. — Davislor, Feb 04 '19 at 04:51
If you have the source or can translate from codepoints to \commands, the unicode-math package has a full list of the \commands it supports for every Unicode character. — Davislor, Feb 04 '19 at 04:54
@HenriMenke yes MathPix is a great alternative but we also need a free option. — Himanshu, Feb 04 '19 at 04:55
Thanks @Davislor for sharing. For now, I am looking for the equivalent where Unicode is not output into the PDF. — Himanshu, Feb 04 '19 at 05:14
Right, so if you have a font table for the original font, you might be able to make a map for their unique “encoding,” like you can for OT1 or OML. — Davislor, Feb 04 '19 at 05:32
@Davislor Sorry, got that now. Let me check what's there in the font tables. Thanks. — Himanshu, Feb 04 '19 at 05:51
You could look in the mmap package. It has such mappings for tounicode. — Ulrike Fischer, Feb 04 '19 at 07:38
can you not locate the tex source of the pdf, given that it is using cmmi font it almost certainly was made with tex. The mapping is document specific as you are listing subsetted fonts that just have the glyphs used in that document. — David Carlisle, Feb 04 '19 at 07:49
For users just wanting a quick two way lookup a general transcoder such as https://www.johndcook.com/unicode_latex.html will show \updownarrow as Unicode: U+2195 , \simeq as Unicode: U+2243 but will \NOT show the composed \neq which may be Unicode: U+003D & U+0338 — , Feb 05 '19 at 14:28
The hex in the cmap-file you are showing simply maps a glyph number to a command names. E.g. the first (with suitable spaces) is 005C 0062 0069 0067 0063 0069 0072 0063 0020 = \bigcircle. In other cmap-files you can find mappings from glyphs to unicode positions. E.g. <41> <D835DD38> where D835DD38 is the UTF16 version of U+1D538 (MATHEMATICAL DOUBLE-STRUCK CAPITAL A). — Ulrike Fischer, Feb 19 '19 at 14:02
as Ulrike has pointed out the hex should be easy to process into a tabular form such as 30 30 005C 0062 0069 0067 0063 0069 0072 0063 0020 30 31 005C 006D 0064 006C 0067 0062 006C 006B 0063 0069 0072 0063 006C 0065 0020 30 32 005C 0073 0071 0075 0061 0072 0065 0020 which gives me 00\bigcirc 01\mdlgblkcircle 02\square 03\blacksquare 04\vartriangle 05\blacktriangle 06\triangledown 07\blacktriangledown 08\lozenge 09\blacklozenge 0A\mdlgblkdiamond — , Feb 19 '19 at 17:20
@UlrikeFischer that works for me. Could you please make that an answer. — Himanshu, Feb 20 '19 at 09:12
What did work for you? I only told you how to read the cmap file. — Ulrike Fischer, Feb 20 '19 at 09:21
Ok, let me go double check what is still missing. Being able to read the cmap file is helpful but I am not sure if I have all the pieces, yet. Thanks anyway. — Himanshu, Feb 20 '19 at 10:36
@Himanshu what do you mean? Mathpix is free. Harvard has also made an open source alternative, if by free you mean open source. http://lstm.seas.harvard.edu/latex/ — Mohammed Shahid, Feb 21 '19 at 02:21
@MohammedShahid Thanks for suggesting the alternative, will check. Free to use would have been fine but Mathpix api has a fee after 1000 calls - https://mathpix.com/ocr#pricing — Himanshu, Feb 21 '19 at 05:18
“Is the information really inside the font?” — Yes and no. The “tables” you are looking for are probably the ones in LaTeX font encoding guide. More specifically, the standard math font encodings for Computer Modern with TeX are OML, OMS and OMX (Appendix A.4, pp. 33–34 in the linked guide). Other math fonts (such as newtxmath, newpxmath and mtpro2 lite) may use entirely different encodings. So I doubt a single “map” is useful. — Ruixi Zhang, Feb 22 '19 at 16:34
For example, for Computer Modern math and plain TeX, \epsilon is mapped to '017 = "0F in OML; \Updownarrow is mapped to '155 = "6D in OMS; \int is mapped to '122 = "52 in textstyle and to '132 = "5A in displaystyle in OMX. Here ' indicates octal number and " indicates hexadecimal. The LaTeX math font encoding can be different and the mapping is (and should be) done by packages. — Ruixi Zhang, Feb 22 '19 at 16:42
Thanks @RuixiZhang. So you are saying that packages do the splitting into glyphs and that the mapping may not always be available in the pdf. This is closest to what I needed. Would you like to make this an answer? — Himanshu, Feb 23 '19 at 15:20

Ruixi Zhang · Accepted Answer · 2019-02-24T02:30:31.160

Let’s start with the following example:

\documentclass{article}
\newcommand*\testsqrtsign[1]{\sqrtsign{\vphantom{#1}}}
\pagestyle{empty}
\begin{document}
\[
\testsqrtsign{|}\testsqrtsign{\big|}\testsqrtsign{\Big|}\testsqrtsign{\bigg|}\testsqrtsign{\Bigg|}
\]
\end{document}

Compile the above code via pdfLaTeX and then open the PDF file via Adobe Acrobat Reader DC. In the opened PDF file, press Ctrl + F and type “pqrsvuut” in the Find bar. Press the Enter key or the Next button, and we find that

How bizarre, isn’t?

Inspecting the PDF file further, we find that a font named “cmex10” is embedded. This simple experiment gives you a taste on how mathematical symbols are encoded in default LaTeX (and to certain extent — the original TeX).

To address your question

I wonder if such a mapping table already exists in the reverse direction for use within LaTeX.

The short answer is: Yes.

Part 1: The default mathematical encodings

According to the LaTeX font encoding guide, there are 3 math font encodings by default (Section 2.6 on page 10), namely, OML, OMS and OMX. In particular, Appendix A.4 (pp. 33–34) lists 3 tables showing where exactly each math letter/symbol is encoded.

For instance,

the “Greek math italic lowercase epsilon” is encoded in OML at position '017 (octal) or "0F (hexadecimal), corresponding to the font “cmmi10” (Computer Modern Math Italic 10);
the “up down double arrow” is encoded in OMS at position '155 (octal) or "6D (hexadecimal), corresponding to the font “cmsy10” (Computer Modern Math Symbols 10);
the “integral sign in \textstyle” is encoded in OMX at position '122 (octal) or "52 (hexadecimal), corresponding to the font “cmex10” (Computer Modern Math Extension 10);

Part 2: The mapping from commands to slots

The code containing the mapping from commands \epsilon, \Updownarrow and \int to their corresponding slots can be found in fontdef.dtx. For instance, we find these declarations:

...
\DeclareSymbolFont{letters}     {OML}{cmm} {m}{it}
\DeclareSymbolFont{symbols}     {OMS}{cmsy}{m}{n}
\DeclareSymbolFont{largesymbols}{OMX}{cmex}{m}{n}
...
\DeclareMathSymbol{\epsilon}{\mathord}{letters}{"0F}
...
\DeclareMathDelimiter{\Updownarrow}
   {\mathrel}{symbols}{"6D}{largesymbols}{"77}
...
\DeclareMathSymbol{\intop}{\mathop}{largesymbols}{"52}
    \def\int{\intop\nolimits}
...

This is the “reverse” table you are asking for:

\epsilon is from letters, which is OML encoded and is located at "0F.
\Updownarrow, when acts not as a delimiter, is from symbols, which is OMS encoded and is located at "6D.
\intop is from largesymbols, which is OMX encoded and when used in \textstyle is located at "52.

Part 3: Instructing LaTeX to load the actual font files

This part of the code can also be found in fontdef.dtx:

...
\input  {omlcmm.fd}
\input  {omscmsy.fd}
\input  {omxcmex.fd}
...

but seems to be irrelevant to your current question. Feel free to look at How (La)TeX makes use of font related files […] when selecting fonts? and related post to learn more. This part is included here because…

Part 4: Other math fonts and non-standard encodings

The newtxmath package provides a complete upright Greek alphabet (\Gammaup, \alphaup, etc.). They are from lettersA, which is declared in newtxmath.sty as

...
\DeclareSymbolFont{lettersA}{U}{ntxmia}{m}{it}
...

where U stands for “Unknown”. The corresponding untxmia.fd file contains a variety of fonts: “nxlmia”, “zmnmia”, “zcochmia”, “zchmia”, “ntxstx2mia” and “ntxmia”, and their bold versions. In theory, the author can use whatever encodings he/she pleases for these fonts. For newtxmath, we see that

...
\re@DeclareMathSymbol{\Gammaup}{\mathalpha}{lettersA}{0}
...

So if you write, say $\bm{\Gammaup}$ , where \bm is provided by the bm package, then you can get a bold upright Greek uppercase Gamma. In Unicode, “Mathematical Bold Capital Gamma” is encoded at U+1D6AA, while in “lettersA” of newtxmath, it is encoded at 0 (decimal, the first slot in the font) in both regular and bold fonts.

Now you see the problem: There cannot be a single mapping that converts extracted symbols to their corresponding Unicode characters.

Due to the lack of development in math font encodings (see LaTeX font encoding guide, the last 3 paragraphs at the end of Section 1.2), math fonts can have a variety of different “in-house” encodings. Beside newtxmath’s “lettersA” (U-encoded), there are amsfonts’s “AMSa” and “AMSb”, both U-encoded; there are mtpro2’s (commercial fonts) LMP1, LMP2 and LMP3 encodings; etc.

Concluding remarks

There are many math font encodings beside the standard 3 on the market and they are tied to specific fonts. The information about the mapping between input characters and their corresponding font slots can be found in the support LaTeX packages.

Since there are no “universally agreed” math font encodings, one cannot expect the usefulness of a single mapping (if it exists) from glyphs back to commands/Unicode characters.

If you simply want to copy-and-paste math formulas in the PDF file, then maybe give unicode-math a try:

% !TeX program = XeLaTeX or LuaLaTeX
\documentclass{article}
\usepackage{unicode-math}
\begin{document}
\[\int_0^{\pi\pm\epsilon} \sin x \, \symup{d} x = 2 \mp \delta\]
\end{document}

Kneel before the power of unicode-math, mortals!

Okay — back to default encodings — why can we search “pqrsvuut” for the square root signs? Well, the first 4 extended root signs are encoded in OMX at positions "70, "71, "72 and "73, respectively; while the last “vertical” root sign is pieced together using one "76, two "75’s and one "74. Guess what are usually at positions "70 through "76 ;-)

For more information on how LaTeX handles font, the two main references (available at https://ctan.org/pkg/latex-base) are

Font encoding guide
Font selection guide

@Dr.ManuelKuehner I’m afraid I’ve spent waaay too much time on font related stuff. A bad habit of procrastination :) — Ruixi Zhang, Feb 23 '19 at 22:36
Thanks for the details and the summary in the last paragraph! Are there any books, resources you'd like to recommend to understand font handling in Latex? — Himanshu, Feb 23 '19 at 23:42
@Himanshu The two main references on LaTeX font handling are Font encoding guide and Font selection guide, available at https://ctan.org/pkg/latex-base I also added an example illustrating the usage of unicode-math, which allows you to, well, work with Unicode. The only drawbacks are that there are way less Unicode math fonts out there in the market (see Which OpenType Math fonts are available?), and unicode-math itself is actively being developed. — Ruixi Zhang, Feb 24 '19 at 02:34