Extraction of high level objects from .tex source files

Question

I have idea of a tool to conveniently skim through arXiv papers/articles. In short, it ought filter out parts of document based on its type (images, tables, formulas, paragraphs etc). Obviously it would be magnitudes of order easier to perform on HTML rather than PDF. AFAIK PDF doesn't contain any info about document structure, basically it's a set of instructions "draw this glyph there", so restoring the structure becomes (an unnecessary) non-trivial OCR problem.

Straightforward solution then is to convert paper .tex source to HTML. Available options are:

lwarp: seems like it's for writing with HTML target in mind from outset, requires a lot of source patching
latex2html: pretty robust, but some content is converted to images which is not great for responsiveness/a11y
pandoc: ditched by arxiv-vanity/engrafo devs; couldn't find any comparisons with other tools
make4ht/tex4ht: not sure I understand its internal translation process (.tex -> DVI -> HTML?), overall performs fine with occasional artifacts which I have no idea how to fix (hooking into LuaTeX internals?), but some papers fail to render
latexml: there are efforts to process an entire arXiv to HTML, and looking at numbers it's not bad, but haven't tested myself yet; that's what arxiv-vanity uses and from user's perspective conversion quality is on par with tex4ht

Most likely I will use make4ht or latexml after additional testing, but an idea came to me to extract relatively "high level" objects as if I would have thrown out any typesetting; what TeX engine gets after LaTeX content macros are processed. LuaTeX glyph nodes seem to be too low-level for my purpose.

So my question is does LuaTeX/TeX somewhere internally operates on something akin to abstract syntax tree of document?

tex doesn't process the whole document so it doesn't really have an ast at all, it is designed for the memory constraints of 1980, the latex macros are only expanded one-by one as the lower level typesetting needs more text, and earlier pages are converted to dvi or pdf and output from memory before later parts of the document are even read from file — David Carlisle, Apr 25 '21 at 18:18
this is probably relevant https://www.latex-project.org/news/2020/11/30/tagged-pdf-FS-study/ — David Carlisle, Apr 25 '21 at 18:29
As far as I know LaTeX is concerned with typesetting the text of a document. For this it doesn't need, or have, any concept of an abstract syntax tree, whatever that might be. If you want to "skim" a document then are you considering the initial, or the second, or the third skim? In my GOM opinion I think it best to let the document readers do their own skimming in their own fashion. — Peter Wilson, Apr 25 '21 at 18:30
@PeterWilson in order to make accessible, "tagged" pdf (a legal requirement for some uses) latex will have to generate a "structure tree" annotating the tree structure of the document over the basic typesetting function — David Carlisle, Apr 25 '21 at 18:33
@PeterWilson once more or less consistent way is found, this might be easily automated; skims may be easily configured then based on user preferences — theuses, Apr 25 '21 at 18:38
@DavidCarlisle well that's both blessing and a curse; premature optimization as Knuth would've stated himself :) Tagged pdfs seem interesting, I'll take a look, thanks! But in my humble opinion all of that work could be easily avoided if HTML became a paper distribution standard — theuses, Apr 25 '21 at 18:39
@theuses even by the late 1980s tex could only hold 20 or 30 rows of a table in memory but I had longtable test files (and real users) of thousands of pages. holding the document in memory wasn't an option. — David Carlisle, Apr 25 '21 at 18:44
@theuses generating html and generating tagged pdf share many of the same issues, either way you need to know where headings and lists and math etc stop and start, it's getting that information from a latex document full of user defined macros that is the hard part — David Carlisle, Apr 25 '21 at 18:47
@DavidCarlisle sorry for not being clear, I was just joking, surely Knuth couldn't care less if couple decades later computers would be able to hold thousands of books in memory, task had to be solved with whatever resources he had. I agree that generating HTML and tagged PDF (from latex) share issues, but what if everyone would write in some subset of HTML with predefined components for scientific writing, just as LaTeX does that on top of TeX? — theuses, Apr 25 '21 at 18:54
then this site would be out of a job and we could all go home? — David Carlisle, Apr 25 '21 at 20:41

Extraction of high level objects from .tex source files

0 Answers0