5

I'm surprised/confused by LaTeXML's handling of lists. With an example like this called test.tex:

\documentclass{article}
\begin{document}

Hello itemize:
\begin{itemize}
  \item one
  \item two
\end{itemize}

\end{document}

Running the command

latexml test.tex | latexmlpost - --destination=test.html

Produces the following HTML: (snipped to the interesting bits)

<article class="ltx_document">
<div id="p1" class="ltx_para">
<p class="ltx_p">Hello itemize:</p>
<ul id="I1" class="ltx_itemize">
<li id="I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize">•</span> 
<div id="I1.i1.p1" class="ltx_para">
<p class="ltx_p">one</p>
</div>
</li>
<li id="I1.i2" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize">•</span> 
<div id="I1.i2.p1" class="ltx_para">
<p class="ltx_p">two</p>
</div>
</li>
</ul>
</div>
</article>

This is strange because not only is the bullet typeset explicitly (which I wouldn't have expected), more importantly the content of the list is wrapped in div and p tags which means there's a linebreak between the bullet and the item!

enter image description here

Obviously that's not a problem if I'm using the latexml CSS file, but I'd like to re-use the HTML in a context in which I can't customise the CSS used.

Anyone have any ideas on how to "improve" the output here?

I'm using version

latexml (LaTeXML version 0.8.2; revision 644644a)
  • There are some reason to do not use pandoc? This program convert your MWE in a simpler HTML file without the SPAN and DIV tags. – Fran Feb 08 '17 at 18:49
  • Thanks for the suggestion, I'll look into pandoc. Can it deal with semi-arbitrary code like \newcommand? – Will Robertson Feb 09 '17 at 07:32
  • Yes, the macro \foodefined as \newcommand\foo{bah} is converted in HTML as <span>bah</span> (obviously losing the definition). – Fran Feb 09 '17 at 07:52
  • That was a bad example from me: it looks like pandoc treats \newcommand as a special case :) To be honest, I suspect that I could get pandoc to do what I needed... but I like the degree to which latexml actually understands the LaTeX source. Whether I take advantage of such things remains to be seen! – Will Robertson Feb 09 '17 at 08:44
  • But macros are especial cases. The are no way (afaik) to make these macros in pure HTML, nor there are equivalent HTML tags for any macro than you can invent. Then, the more seasonable for me is parse LaTeX definitions and apply the expanded macro. Otherwise, what should be done? – Fran Feb 09 '17 at 14:00

3 Answers3

2

Interesting situation; clearly what you're asking for is the "correct" result for the case you've given. LaTeXML's more general approach is attempting to model the wide variety of things that LaTeX typically throws at it. For example, any itemization can contain an arbitrary \item[$a+b$] which needs markup, rather than attribute values, hence the span for the bullet. Not to mention, the author may have redefined \labelitemi. Likewise, any item can contain multiple paragraphs, hence the div, p elements. So, at least that's the excuse for LaTeXML's internal XML. How do other processors handle such cases?

I guess a solution might be for LaTeXML's xslt to examine the itemizations for non-special cases, and dumb them down. Hmm... Really no possibility of sneaking some CSS in there?

  • Thanks for the reply Bruce! I'm guessing other tools don't do it as comprehensively as latexml :-) CSS can be placed inline in HTML elements right? I'm not saying this should happen out of the box, but maybe I can adapt the post processor to add the necessary CSS declarations in this case. (Even if it's an ugly regexp hack!) – Will Robertson Feb 10 '17 at 01:34
  • I just tried embedding the CSS into the webform in between <style> tags, with no luck — is stripped out :) I'll try to boil down the required declarations to add in <div style="..."> next… – Will Robertson Feb 10 '17 at 07:10
  • …and either my HTML+CSS skills aren't up to scratch, or it's not really possible to do advanced CSS inline in a style= attribute. I worked around the problem as shown in my answer above/below. – Will Robertson Feb 11 '17 at 11:53
  • (Odd, my reputation is too low to comment on an answer, but I can comment on a comment?) Your solution sounds painful! OTOH, generically dumbing down the markup would lead to inconsistencies, probably should be optional. Further discussion on that is probably less of general interest, so followup on Will's github issue: [https://github.com/brucemiller/LaTeXML/issues/829] – B. Miller Feb 11 '17 at 13:00
1

I think Bruce's answer that latexml's output is correct in the general sense is persuasive. For my purposes, however, I was willing to compromise in the interests of, well, having HTML that displayed sensibly without needing the associated CSS file for correct rendering.

For me, that involved using a regular expression to strip out the extra markup that causes the issue:

sed -ie 's/<span class="ltx_tag ltx_tag_itemize">•<\/span>//g' test.html
sed -ie 's/class="ltx_item" style="list-style-type:none;"/class="ltx_item"/g' test.html

In short, this simply strips out the explicit bullet and removes the directive to turn off the automatic bullet. No doubt I may need something similar for enumerate, but description works well the way it is already.

1

As OP has seen on the github issue, you can now use

latexmlpost --dest=test.html --xsltparameter=SIMPLIFY_HTML:true test

to get the html

<article class="ltx_document">
<div id="p1" class="ltx_para">
<p class="ltx_p">Hello itemize:</p>
<ul id="I1" class="ltx_itemize">
<li id="I1.i1" class="ltx_item">
<div id="I1.i1.p1" class="ltx_para">
<p class="ltx_p">one</p>
</div>
</li>
<li id="I1.i2" class="ltx_item">
<div id="I1.i2.p1" class="ltx_para">
<p class="ltx_p">two</p>
</div>
</li>
</ul>
</div>
</article>

This will skip over any optional argument to \item[]. If someone wanted to mix the two, they would probably need to \usepackage{latexml}, sprinkle in some \lxAddClass{simpleList}, and create a custom xslt for latexmlpost to use.

Teepeemm
  • 6,708