I am trying to extract Math content from LaTeX generated PDF files. Most extracted symbols get extracted fine. However some, such as \epsilon, \Updownarrow, \simeq use non Unicode codes and others such as \neq use a combination of non Unicode codes.
\epsilonis written using the embedded fontSCCPFS+CMMI10and code 017\Updownarrowusing the embedded fontKAXSYH+CMSY10and code0x6d (m)\simequsing the embedded fontKAXSYH+CMSY10and code0x27 (')\nequsing the embedded fontKAXSYH+CMSY10and codes0x36 (/)and0x3d (=)
Before I begin writing a table to map from the glyph code(s) to the equivalent LaTeX I wonder if such a mapping table already exists in the reverse direction for use within LaTeX. After all, somewhere the original \epsilon, \neq etc. would be getting mapped to one or more glyph codes. The combination cases will require position information also, but that should be there too, in the reverse direction.
EDIT: I tried to lookup this information in the font table but there are no entries in GSUB and GPOS. Is that where I should be looking? Is the information really inside the font?
EDIT: I tried looking up the mmap file in a text editor but it is mostly hex. Is there a tool for opening it?
%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (TeXmath-LMR-0)
%%Title: (TeXmath-LMR-0 TeXmath LMR 0)
%%Version: 1.000
%%EndComments
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TeXmath)
/Ordering (LMR)
/Supplement 0
>> def
/CMapName /TeXmath-LMR-0 def
/CMapVersion 1.000 def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
96 beginbfchar
<00> <005C00620069006700630069007200630020>
<01> <005C006D0064006C00670062006C006B0063006900720063006C00650020>
<02> <005C0073007100750061007200650020>
<03> <005C0062006C00610063006B0073007100750061007200650020>
<04> <005C0076006100720074007200690061006E0067006C00650020>
<05> <005C0062006C00610063006B0074007200690061006E0067006C00650020>
<06> <005C0074007200690061006E0067006C00650064006F0077006E0020>
<07> <005C0062006C00610063006B0074007200690061006E0067006C00650064006F0077006E0020>
<08> <005C006C006F007A0065006E006700650020>
<09> <005C0062006C00610063006B006C006F007A0065006E006700650020>
<0A> <005C006D0064006C00670062006C006B006400690061006D006F006E00640020>
EDIT: I looked up the character for \neq and it was composed of two different fonts so unlikely that this information is in one font. Doing a grep in the texlive directory gives some hints:-
% grep -rw neq * | grep -w not
texmf-dist/tex/plain/base/plain.tex:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/generic/enctex/utf8raw.tex:\mubyte \neq ^^e2^^89^^a0\endmubyte % U+2260 not equal to
texmf-dist/tex/generic/ofs/ofs-cm.tex: \def\neq{\not=}
texmf-dist/tex/latex/listings/lstlang3.sty: myfont,n,nat2string,neq,ngon,norm2,normalmap,not,nu_grid,nubspline,%
texmf-dist/tex/latex/sansmath/sansmath.sty:% two lines, but it did not work well (unbold +, bold greek, bad \neq)
texmf-dist/tex/latex/base/fontmath.ltx:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/latex/unicode-math/unicode-math-xetex.sty: \cs_gset:cpn { not= } { \neq }
texmf-dist/tex/latex/unicode-math/unicode-math-table.tex:\UnicodeMathSymbol{"02260}{\ne }{\mathrel}{/ne /neq r: not equal}%
texmf-dist/tex/latex/unicode-math/unicode-math-luatex.sty: \cs_gset:cpn { not= } { \neq }
texmf-dist/tex/latex/breqn/cmbase.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathpazo.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathptmx.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}



inputencmappings go the other direction. – Davislor Feb 04 '19 at 04:51unicode-mathpackage has a full list of the\commandsit supports for every Unicode character. – Davislor Feb 04 '19 at 04:54005C 0062 0069 0067 0063 0069 0072 0063 0020=\bigcircle. In other cmap-files you can find mappings from glyphs to unicode positions. E.g.<41> <D835DD38>whereD835DD38is the UTF16 version ofU+1D538(MATHEMATICAL DOUBLE-STRUCK CAPITAL A). – Ulrike Fischer Feb 19 '19 at 14:02newtxmath,newpxmathandmtpro2 lite) may use entirely different encodings. So I doubt a single “map” is useful. – Ruixi Zhang Feb 22 '19 at 16:34\epsilonis mapped to'017 = "0Fin OML;\Updownarrowis mapped to'155 = "6Din OMS;\intis mapped to'122 = "52in textstyle and to'132 = "5Ain displaystyle in OMX. Here'indicates octal number and"indicates hexadecimal. The LaTeX math font encoding can be different and the mapping is (and should be) done by packages. – Ruixi Zhang Feb 22 '19 at 16:42