Parser for pure LaTeX

Question

Is it possible to create a parser for pure LaTeX (no plain TeX, no TeX primitive) without using any TeX engine that supports total LaTeX? I know about an iOS app to create and typeset LaTeX and I don't think that they are using a TeX engine. Is there such a Parser?

Could you elaborate on how this differs from your other question? At the least, that had much more detail, and several worthwhile comments that could help you better phrase your question. — Teepeemm, Oct 26 '18 at 18:43
Although this question is about parsing TeX code at its lowest (the way I interpreted it, at least) I think that the [tag:tex-core] tag is to the inner working of TeX itself, and you specifically ruled it out, so I would remove it... Either way, I think it's a good question but requires more details. You want to parse only proper LaTeX code, or plain TeX needs to be included. Packages considered? Other formats? Please explain you question better or it will probably be closed as "too broad". — Phelype Oleinik, Oct 26 '18 at 18:43
A clearly written Turing complete LaTeX parser in a high level language would be lovely to see. — Simd, Oct 26 '18 at 19:34
What is pure LaTeX in your view? Are commands like \def (TeX command, but used in many LaTeX documents) out of scope? And what should the output of the parser be? — TeXnician, Oct 26 '18 at 19:35
It’s way off topic but a LaTeX to speech system would seem to need to fully parse the LaTeX, as an example. — Simd, Oct 26 '18 at 19:43
Related: is there a python module for parsing LaTeX? - TeX - LaTeX Stack Exchange — user202729, Mar 09 '24 at 11:16

David Carlisle · Accepted Answer · 2018-10-27T07:21:33.070

If you write a parser you can define the subset of latex that you support. (There isn't really a useful definition of "Pure LaTeX with no primitives".)

For instance MathJax has a parser for a subset of LaTeX math markup, written in JavaScript, and LaTeXML has a parser for almost complete TeX written in perl, which does not include any TeX execution. LaTeXML's parser is perhaps the closest to what you ask, as far as I understand the question. https://github.com/brucemiller/LaTeXML

Here is an example that only uses commands defined in core latex. (The shortvrb package is part of the base LaTeX2e release, so it is as fundamental part of latex as say \section which is defined in article class from the same base release files.)

\documentclass{article}
\usepackage{shortvrb}


\begin{document}

\MakeShortVerb\*

 {\bfseries *}{* some text}

\DeleteShortVerb\*

 {\bfseries *}{* some text}

\end{document}

Note that it is not possible to statically assign any tokenisation to *}{* in the first case it produces the two character tokens }{ in the second case it produces two character tokens ** (the first one being bold).

It would be reasonable to produce a LaTeX parser for a subset of the language that did not include this kind of construct, but you need to define the subset it isn't enough to say "not plain TeX or primitives" there are plain constructs that can be easily parsed, and there are LaTeX constructions that can not be parsed in general without access to a full tex typesetting system.

Thank you. This confirmed my suspicions. As you said LaTeXML supports almost complete TeX after 14 years of development. E. g. LaTeXML can't handle the code above exactly like TeX. Thus supporting just a subset of LaTeX is the right way. — John webner, Oct 27 '18 at 08:14

score 1 · Answer 2 · answered Oct 26 '18 at 18:43

1

I think this already occurs for document conversion software such as pandoc, and others on the internet. Generally speaking these converters only parse a subset of the commands. In addition regex can be used to extract certain tags of interest.

answered Oct 26 '18 at 18:43

GrandFleet

336

I know that parsers like pandoc can only parse a subset of the commands. E.g. Pandoc can not parse \emph{text\section{text}text}. I am looking for a full pure LaTeX parser. – John webner Oct 26 '18 at 19:17
1

@Johnwebner Tbh, you shouldn't use such markup, but of course a parser might want to spit out something… – TeXnician Oct 26 '18 at 19:34
1

I think you are better off using tex then. doubt there can be a full parser without implementing tex or a subset. – GrandFleet Oct 26 '18 at 19:45
@vy32 A programming language being Turing Complete in what it can express, is not the same as it requiring a Turing Complete parser. Think of LISP or PostScript, whose syntaxes are near-trivial to parse, but are still complete programming languages. – TextGeek Jun 10 '22 at 18:46
1

@TextGeek there are plenty of parsers for well-formed LISP and PostScript. I want one for well-formed LaTeX. It doesn't need to execute the code to create a parse tree. I'm okay with a parser that doesn't understand redefining the { or } or \. I don't even know if that is possible! – vy32 Jun 11 '22 at 19:35

Parser for pure LaTeX

2 Answers2