is there a python module for parsing LaTeX?

Question

I am looking to write python programs that modify LaTeX source files. To do this I would like to have a basic parser in python that can reliably read and write LaTeX files while maintaining the tree. I'm okay if it is not a full implementation, but I need it to handle LaTeX's odd quoting rules and {} notation. Regular expressions simply do not work for this, due to the fact that braces can be recursive.

EDIT:

The main thing I want to handle is recursive braces, which is why I need a parser, rather than a simple lexical analyzer. That is, I want to be able to register \foo{} as a command I care about and catch:

\foo{this is the foo argument}

But I also want to be able to catch:

\foo{this is \emph{really} the foo argument}

Is there any such python module out there?

I'm not aware of a library that can directly parse LaTeX and write it back to a file. One way (admittedly not very straightforward) might be to convert LaTeX to semantics-preserving xml with LaTeXML, read the xml with Python, and regenerate LaTeX using PyLaTeX. — Marijn, Sep 27 '20 at 13:59
If you mean that the resulting .tex is the same as the original input file, then no. But depending on the actual use case this may still be a suitable approach. — Marijn, Sep 27 '20 at 14:03
You can try TeXSoup. However, there is yet to be a complete LaTeX parser that can fully handle macro definitions, catcodes, LaTeX3 and so on. — Alan Xiang, Sep 27 '20 at 14:14
there clearly can be no full parser try xii.tex (although actually latexml parses that with perl, which is quite a feat) but there will always be cases that require a full tex system to parse correctly. So it's a matter of how strict you want to be. Just parsing simple "latex book" document markup not tex macro code should be fairly easy. When you say "maintaining the tree" note latex never parses the full file and never constructs a tree. — David Carlisle, Sep 27 '20 at 16:10
I clarified the question to show the specific thing I want to do — vy32, Sep 27 '20 at 18:03
Thanks for the clarification. However, it's still not entirely clear to me: let's say you have successfully extracted this is \emph{really} the foo argument, then what do you want to do with it? Modify and write back? What kind of modification? And what is the reason for doing this? Note that if it is just about nested braces then you may also be able to do what you want using general purpose libraries, e.g., from pyparsing import nestedExpr. — Marijn, Sep 28 '20 at 19:04
Cross-site same question: python - Programmatically converting/parsing LaTeX code to plain text - Stack Overflow — user202729, Mar 09 '24 at 11:21

score 8 · Accepted Answer · answered Nov 11 '20 at 16:54

8

Please see if the LatexWalker class of pylatexenc can help:

from pylatexenc.latexwalker import LatexWalker
w = LatexWalker(r"\foo{this is \emph{really} the foo argument}")
(nodelist, pos, len_) = w.get_latex_nodes(pos=0)
print(nodelist[0].macroname)
print(nodelist[1].latex_verbatim())
>>> foo
{this is \emph{really} the foo argument}

answered Nov 11 '20 at 16:54

Matteo Gamboz

432
3
8

I tried to use your advice for this but quite complex apparently – JeT Dec 04 '20 at 11:00
Do you know how to navigate over the "sub-nodes" of nodelist? I mean, nodelist[-1] has almost the entire document but I don't know how to go through the sections, paragraphs, equations, and so on. – user171780 Apr 04 '21 at 17:26
I have a recursive function to walk the node and find some macro that I'm looking for. It does not fit here. If you make a new question I can answer there. – Matteo Gamboz Apr 06 '21 at 08:50
@MatteoGamboz I cannot parse paragraphs as separate objects, do you know how to do? here is my question. – user171780 Apr 22 '21 at 23:35

is there a python module for parsing LaTeX?

1 Answers1

Linked