25

I've read here and there that ConTeXt can produce XML output. We also have, from time to time, questions about converting LaTeX to different formats. On the basis that "The only parser for TeX is tex", if latex could produce text output instead of PDF then it would be possible to write a style file to convert reasonable input to a different markup language.

Would this be possible?

Bit of background: I encounter this "can we convert from LaTeX?" question in the context of the nLab where the input format is Markdown+iTeX (iTeX not being anything to do with Knuth's proposal but a subset-of-LaTeX-to-MathML converter) but people often have snippets of LaTeX articles that they want to include. So converting all the way to XHTML+MathML via, say, tex4ht isn't the right option. I wrote a Perl script that reimplements much of TeX to do this, but after doing so realised that my style files would work in ordinary LaTeX and produce the "right" output, except that they would be embedded in a PDF. So if I could just persuade TeX to produce text, I'd be almost there. Of course, I could try to extract the text from the PDF but that "feels wrong" and I'd worry about extra stuff sneaking in by accident.

Andrew Stacey
  • 153,724
  • 43
  • 389
  • 751
  • 2
    What's wrong with a simple pdftotext postprocessor. Extra stuff can sneak in by accident in whatever solution you try. – Aditya May 31 '11 at 18:59
  • @Aditya: I know, but I feel that I have more understanding of what TeX can produce itself. If it is a genuine text file, then I'd know what to filter out afterwards. It's partially to do with my own lack of understanding of things like the PDF format. I guess what I want is for TeX to send all "printable" characters to a "write". – Andrew Stacey May 31 '11 at 20:23
  • @Aditya: How does ConTeXt do this? (I got the quip about parsing TeX from your blog, by the way.) – Andrew Stacey May 31 '11 at 20:23
  • I am not 100% sure how ConTeXt does it. The code is in back-exp.lua. IIUC, it builds a tree of the entire document in memory...each macro and environment defined using \define and \definestartstops hooks into that tree; all the core envuronments (itemize, enumerate, section, etc) hook into the tree. Then, at the end of the document, ConTeXt simply serializes the tree and writes it to a separate text file. – Aditya May 31 '11 at 21:34
  • In my experience, pdftotext is pretty reliable. Just create a pdf with teletype font, no headers and footers (AND no math and no graphics). Think of the pdf output as your text file. – Aditya May 31 '11 at 21:38
  • @Aditya: pdftotext will probably fail horribly with ligatures and anything that really uses Unicode - e.g. I doubt it would do anything reasonable with Hebrew. – Martin Schröder May 31 '11 at 22:38
  • @Aditya: Hmm. Maybe worth looking in to. As Martin says, I would also need to disable all ligatures (and hyphenation). – Andrew Stacey Jun 01 '11 at 09:07
  • @Aditya: Okay, my first experiments are promising. Stick a \ttfamily at the start and that seems to deal with hyphenation and ligatures. I shan't try Hebrew! My sticking point now is getting newlines in to the outputted text, in particular double newlines (in place of \par). – Andrew Stacey Jun 01 '11 at 20:04
  • @Andrew: To get newlines in output, in ConTeXt I add \setupwhitespace[line] (which, in LaTeX parlance, sets the \parskip to be equal \baselineskip) and then use pdftotext -nopgbrk -layout. The -layout option also does line wrapping, so you may need to play with the paper width if you want to prevent line wrapping. – Aditya Jun 03 '11 at 04:57
  • 1
    @Aditya: I've now implemented something that uses pdftotext as the last stage ... and it works! I was able to produce the source text to this page: http://ncatlab.org/nlab/show/equivariant+tubular+neighbourhoods by writing a LaTeX document with a special style file. So could you assemble your various comments in to an answer which I can accept? (If it's alright by you, after you've done that then I might add some details on exactly what I did, but I'd like to give you the credit for the solution.) – Andrew Stacey Jun 21 '11 at 21:45

4 Answers4

7

The underlying solution is of course the same for ConTeXt and LaTeX: you need to have a way of changing what macros do such that they write the correct output rather than typesetting. This is also much the same as tex4ht does. The advantage ConTeXt has is that the macros are provided mainly by one focussed group, and they include the necessary 'back end' to make that conversion easy. To do the same for LaTeX, you need to handle all of the macros that might be present, which is a problem given the number and variety of LaTeX packages. So while in principal it's possible, the implementation is a severe challenge.

(With my 'LaTeX3 hat' on, this is an obvious area to bear in mind when defining an updated format. To do that, you need to have a much more 'regular' syntax and input than is often the case with LaTeX files at present. Again, I think ConTeXt shows how this can be done as it is already good on keeping the input within it's own structures.)

Joseph Wright
  • 259,911
  • 34
  • 706
  • 1,036
  • I guess the key point is it partly depends on what package coverage you want. Doing the LaTeX2e kernel, with no hacks in the source, would be relatively straight-forward. Throw in some low-level TeX, a few kernel hacks and a load of packages, and things are different. – Joseph Wright May 31 '11 at 17:31
  • Changing the macros is fine. I've basically already done that as part of my Perl script, and I can add more on a case-by-case basis (I'm not looking for a global solution). It's the fundamental getting LaTeX to produce ASCII text rather than DVI or PDF or ... that I'm interested in here. As I understand it, tex4ht works by adding hooks in to the DVI or PDF and post-processing it. I'd prefer to simply have tex output the text directly. – Andrew Stacey May 31 '11 at 17:48
  • @Andrew: Ah, I see. – Joseph Wright May 31 '11 at 17:59
  • @Andrew: Do your perl scripts take care of references and citations as well? (including styles, hyperlinks, etc.). – Aditya May 31 '11 at 20:18
  • @Aditya: My perl script is a(n imperfect) TeX implementation. It handles expansion and a few basic primitives. So if I tell it what to do with references and citations, it will do them. At the moment, it converts to Markdown(+iTeX) so it simply converts references to Markdown's reference syntax. I'm not looking for a pure or complete conversion - I don't think such is possible or desirable - but for something that does 90% of the job. As I said above, the bit I really haven't a clue about is getting the output in a text file. – Andrew Stacey May 31 '11 at 20:27
  • 2
    A blasphemous suggestion: write a few TeX (rather ConTeXt) macros that can read LaTeX syntax and convert them to ConTeXt Apart from syntax for references, other things is LaTeX are easily convertible to ConTeXt syntax. There was an old project with Brookes Moses (I think) that tried to do this. If you can find that, then most of the ground work is already done. – Aditya May 31 '11 at 21:40
  • 1
    @Aditya: Yes, that was my project. It wasn't that hard to do, just a bit tedious, and I didn't really get all that far with it -- but what I have is here: http://files.dpdx.net/tex/context/latex-compat/. I hadn't thought about it in years; I'm amazed that someone remembers it. – Brooks Moses Jun 02 '11 at 15:59
  • @Brooks: Thanks. You had started this project (including some notes on the wiki on using LaTeX math in ConTeXt) around the time when I first started experimenting with ConTeXt (and didn't really understand TeX programming). I found this very useful for reusing some of the text that was already written in LaTeX. These days I need the reverse: LaTeX macro that translate ConTeXt markup :) – Aditya Jun 03 '11 at 04:37
  • @Joseph: This is definitely a naive question, but what about moving the backend farther back? The true backend is the DVI (or PDF) output routine: presumably all it sees are boxes and glue, rather than macros, so you "just" have to redefine what a box causes to be written to the file (for a text file, you can probably ignore almost all glue subtleties). I somehow doubt this can be affected from within TeX, though. – Ryan Reich Jun 25 '11 at 23:40
5

It is possible to achieve what you want, provided you do not want TeX to act as a parser. In my opinion, part of the success of TeX, is that it has managed to transform itself over the years to act as a language transformation tool. First it was TeX->Postscript and now it is TeX->pdf. Tralics has been fairly successful to produce TeX->XML.

But, I think one needs to look at the problem from a different angle. With todays available technologies one, needs to have a "Universal Mark-up Language". Markdown and Yaml are scaled down tools and can never be able to be full document description languages, so going that route will limit one's efforts.

Sometime back, I designed a CMS based on text files. All mark-up was in plain text and fragments from Wikipedia's markup language. I would load the text file via php and then filter the input and produce the HTML page.

<!--
{{feature-image: http://localhost/images/sample102.jpg }}
{{feature: A collection is like a puzzle...}}
-->

The feature-image was a div and the feature-text the caption. I had commands for image-credits and the like.

Now this is not so difficult to produce with TeX. So my proposal is to actually use TeX to write an intermediate mark-up in a text file then parse with your language of choice to achieve what you wish.

Workflow depending on targets can be one of the following:

   TeX->Intermediate MarkUp->HTML
   TeX->pdf
   TeX->plain text
   Intermediate MarkUp->Translator (javascript, perl, python, 
                        ruby, php, your language) ->TeX

In a nutshell, retain TeX and output into a new mark-up language. Markdown and other technologies can be a subset of this.

\documentclass{article}
\usepackage[demo]{graphicx}
\usepackage{verbdef}
\begin{document}
\makeatletter
%% create file and open it to write
\newwrite\file
\immediate\openout\file=wikimark.wiki
\newif\if@wikimark
\newif\if@html
\@wikimarktrue

\def\image#1#2{%
  \if@wikimark
   \image@@{#1}{#2}
 \else
   \includegraphics{dummy.png}
 \fi
}

\def\Section#1{%
  \if@wikimark
   \section@@{#1}\relax
  \else
   \section{#1}
  \fi
}


\def\image@@#1#2{%
  \immediate\write\file{\string{\string{img:#1\string}\string}}
  \immediate\write\file{\string{\string{img-caption:#2\string}\string}}
}

\edef\hash@@{\string#\string#}

\def\section@@#1{%
  \immediate\write\file{\hash@@ #1}
} 

\makeatother

\Section{Test Section}

\image{http://tex.stackexchange.com/questions/15440/parsing-files-through-lua-tex}{This is the caption}

\closeout\file
\end{document}

The minimal is just a proof of concept. Main idea here is not to redefine the LaTeX commands but rather add new ones with switches for other mark-up.

yannisl
  • 117,160
  • That sounds pretty similar to what I want, apart from that I don't understand your statement "provided you do not want TeX to act as a parser". I want TeX to properly expand its input, acting accordingly, but I want the output file format to be text. I can do this with my perl version of TeX, but after writing it I felt a better solution would be possible if only I could make TeX output actual text. – Andrew Stacey May 31 '11 at 18:46
  • @Andrew Stacey What I mean as long as one does not expect TeX to read a file with alien mark-up like html and try to convert. Everytime TeX writes to an auxiliary file or a log it writes text! PGF can export a table to html. To transform to mathml I would add the tags, and include the maths verbatim and write to the file straight. Going the LuaTeX way might make things easier with the file operations. – yannisl May 31 '11 at 18:57
  • Yiannis: Right, that's fine. All I want is for TeX to process TeX, nothing more. I feel that I can handle the actual conversion bit. It's the sheer mechanics of outputting the main stream to a text file that I don't know how to do. – Andrew Stacey May 31 '11 at 20:24
  • @Andrew OK, I will post a minimal in the morning, getting too late here! – yannisl May 31 '11 at 21:05
  • Yiannis: It looks promising, but how does the ordinary text get in to the new file? – Andrew Stacey Jun 01 '11 at 09:05
  • @Andrew Everything should normally be in commands except paragraphs, I tried this quickly with everypar but did not work out. It will be much easier to make an environment to specify plain text as such. More or less like in ConTeXt. This will require a bit more effort from the user, but the benefits outweigh the disadvantages. What you think? – yannisl Jun 01 '11 at 09:17
  • So do you mean wrapping a paragraph in \begin{markuptext} ... \end{markuptext}? Then it would have to read in all the text in the paragraph, process it, and write it to the file. Is there no way to just make characters write themselves to some file? – Andrew Stacey Jun 01 '11 at 10:19
  • (I'd be happy to use lualatex!) – Andrew Stacey Jun 01 '11 at 10:19
  • @Andrew yes that is my suggestion. See also this link http://tex.stackexchange.com/questions/15440/parsing-files-through-lua-tex – yannisl Jun 01 '11 at 12:36
  • @YiannisLazarides Your suggestion is one of great, is it possible to convert the normal paragraph into the output file, please suggest... – MadyYuvi May 03 '21 at 14:40
4

In the interests of completeness, I feel I should record my current solution (my gut instinct is that this is the best method, but the exact implementation could probably do with improvement). That is to take a leaf out of the ConTeXt book and use LuaTeX. LuaTeX provides me with some hooks to get at the processed output of TeX just before it is packed into boxes and shipped out.

Specifically, I used the hook pre_linebreak_filter to dig out the contents of each line. It's actually not a long way from the idea of StrondBad's answer, just without all the unnecessary stuff and with a bit more control over things like groupings.

My implementation can be found at my github repository for my project. I don't think it can be just cut-and-pasted into something else as it is integrated with some other parts of my project so someone wanting to use the idea would need to untangle it a bit. The crucial files are the Lua file textoutput.lua, mainly the function list_elements, and the TeX file internettext.code.tex, specifically the "true" branch of the conditional \@ifundefined{directlua} on line 53 (at time of writing).

Also, as I said at the outset, although I think this is the right strategy, it is probably not the best implementation.

Andrew Stacey
  • 153,724
  • 43
  • 389
  • 751
1

The wordcount package sets up LaTeX such that every character, space, line break, etc is added to the log file. This means that

\documentclass{article}
\begin{document}
Hello World
ff and fi
\(y=\alpha x+\beta\)
\begin{tabular}{c|c|c}
a&b&c\\
\end{tabular}
\end{document}

results in a log file that includes:

...\3.08632 H
...\3.08632 e
...\3.08632 l
...\3.08632 l
...\3.08632 o
...\3.08632 W
...\3.08632 o
...\3.08632 r
...\3.08632 l
...\3.08632 d

...\3.08632 ^^[ (ligature ff)
...\3.08632 a
...\3.08632 n
...\3.08632 d
...\3.08632 ^^\ (ligature fi)

...\3.08632 y
...\3.08632 =
...\3.08632 
...\3.08632 x
...\3.08632 +
...\3.08632 ^^L

.......\3.08632 a
.......\3.08632 b
.......\3.08632 c

I am not sure if it is possible to get LaTeX to write cleaner output in a separate file or how to deal with the full set of unicode characters, but the whole idea of wordcount is that you can then parse the log file for characters and spaces.

StrongBad
  • 20,495
  • Thanks for the answer. One issue with this approach is that the log file very quickly becomes excessively large. But it's a good idea for short documents. – Andrew Stacey Aug 31 '16 at 20:17
  • @LoopSpace for a book or a thesis, the file size might be a problem, but the log file is under 2 MB for an academic article of around 10 pages. – StrongBad Aug 31 '16 at 21:16
  • It's been a while since I looked at this, but I did do some experiments with this and found that producing the log file made TeX run slower than I would have liked (particularly as I tend to recompile quite frequently). – Andrew Stacey Aug 31 '16 at 21:26