Is it possible to create a parser for pure LaTeX (no plain TeX, no TeX primitive) without using any TeX engine that supports total LaTeX? I know about an iOS app to create and typeset LaTeX and I don't think that they are using a TeX engine. Is there such a Parser?
2 Answers
If you write a parser you can define the subset of latex that you support. (There isn't really a useful definition of "Pure LaTeX with no primitives".)
For instance MathJax has a parser for a subset of LaTeX math markup, written in JavaScript, and LaTeXML has a parser for almost complete TeX written in perl, which does not include any TeX execution. LaTeXML's parser is perhaps the closest to what you ask, as far as I understand the question. https://github.com/brucemiller/LaTeXML
Here is an example that only uses commands defined in core latex. (The shortvrb package is part of the base LaTeX2e release, so it is as fundamental part of latex as say \section which is defined in article class from the same base release files.)
\documentclass{article}
\usepackage{shortvrb}
\begin{document}
\MakeShortVerb\*
{\bfseries *}{* some text}
\DeleteShortVerb\*
{\bfseries *}{* some text}
\end{document}
Note that it is not possible to statically assign any tokenisation to *}{* in the first case it produces the two character tokens }{ in the second case it produces two character tokens ** (the first one being bold).
It would be reasonable to produce a LaTeX parser for a subset of the language that did not include this kind of construct, but you need to define the subset it isn't enough to say "not plain TeX or primitives" there are plain constructs that can be easily parsed, and there are LaTeX constructions that can not be parsed in general without access to a full tex typesetting system.
- 757,742
-
1Thank you. This confirmed my suspicions. As you said LaTeXML supports almost complete TeX after 14 years of development. E. g. LaTeXML can't handle the code above exactly like TeX. Thus supporting just a subset of LaTeX is the right way. – John webner Oct 27 '18 at 08:14
-
I think this already occurs for document conversion software such as pandoc, and others on the internet. Generally speaking these converters only parse a subset of the commands. In addition regex can be used to extract certain tags of interest.
- 336
-
I know that parsers like pandoc can only parse a subset of the commands. E.g. Pandoc can not parse
\emph{text\section{text}text}. I am looking for a full pure LaTeX parser. – John webner Oct 26 '18 at 19:17 -
1@Johnwebner Tbh, you shouldn't use such markup, but of course a parser might want to spit out something… – TeXnician Oct 26 '18 at 19:34
-
1I think you are better off using tex then. doubt there can be a full parser without implementing tex or a subset. – GrandFleet Oct 26 '18 at 19:45
-
@vy32 A programming language being Turing Complete in what it can express, is not the same as it requiring a Turing Complete parser. Think of LISP or PostScript, whose syntaxes are near-trivial to parse, but are still complete programming languages. – TextGeek Jun 10 '22 at 18:46
-
1@TextGeek there are plenty of parsers for well-formed LISP and PostScript. I want one for well-formed LaTeX. It doesn't need to execute the code to create a parse tree. I'm okay with a parser that doesn't understand redefining the
{or}or\. I don't even know if that is possible! – vy32 Jun 11 '22 at 19:35
\def(TeX command, but used in many LaTeX documents) out of scope? And what should the output of the parser be? – TeXnician Oct 26 '18 at 19:35