How do I convert LaTeX containing arbitrary Unicode to XHTML?

Question

I was reading this question(How to get Unicode characters into HTML output) and because his intent is simply to ouput the text to HTML, I was wondering if there is a command that simple does binary dump of the containing content? I could see this being useful in maybe embedding certain types of graphics and specifically utf-8 text in the question above.

I want the following latex.

\documentclass{article}
\begin{document}

\specialCommand{any unicode Character 字}

\end{document}

SpecialCommand is optional but I thought make it easier. To result in(the html has been simplified)

<?xml version="1.0" encoding="utf-8" ?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">  
<html xmlns="http://www.w3.org/1999/xhtml"  > 
<head><title></title> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
<link rel="stylesheet" type="text/css" href="html.css" /> 
</head><body>
<p class="noindent" >any unicode Character 字
</p>

</body></html>

Using htlatex "html.tex" "xhtml, charset=utf-8" " -cunihtf -utf8".

I’m not clear exactly what you’re asking. A binary dump of which content, in what format? Are you thinking of something along the lines of \typeout, or something different? — Davislor, Aug 11 '18 at 05:08
@Davislor it is not clear to me what typeout does but I have added a very simple example. — William, Aug 11 '18 at 13:03
remove the \typeout (and perhaps remove \detokenize depending on the options you are using with htlatex — David Carlisle, Aug 11 '18 at 13:57
don't use htlatex, it just cannot do this. make4ht -ul html.tex will compile your example correctly. — michal.h21, Aug 11 '18 at 15:20
@michal.h21 thank you, you don't need to delete your answer. I will edit the title title and code to make it clear I don't need a raw data dump and I'm okay not using a command. I am just trying to get the text to come across. — William, Aug 11 '18 at 15:22
@William OK, I've undeleted my answer, I hope it solves your issue — michal.h21, Aug 11 '18 at 16:08

score 5 · Accepted Answer · answered Aug 11 '18 at 15:16

If you want to use arbitrary Unicode characters, you need to use TeX engine with Unicode support. This means either LuaTeX or XeTeX. htlatex uses pdfTeX as the compilation engine. pdfTeX has only limited Unicode support and it is quite hard to get it to support CJK characters. You cannot get your sample document to output the CJK character even if you try to compile it with pdflatex, so it isn't really surprise that htlatex doesn't produce anything as well.

Fortunatelly, you can select different engine for tex4ht using alternative build script, make4ht. Specifically, it supports full Unicode with LuaTeX, so even if your sample doesn't produce the wanted output with lualatex (you would need to select a font with necessary glyphs at least), it can produce the wanted HTML. Try the following command:

make4ht -ul html.tex

It produces this:

<!--l. 4--><p class="noindent" >any unicode Character 字
</p>

This is baffling to me but asci for example < do not seem to come across in this solution. — William, Aug 13 '18 at 20:09
@William yes, you need to use \HCode for special HTML characters, like <, > or &. — michal.h21, Aug 13 '18 at 20:51

How do I convert LaTeX containing arbitrary Unicode to XHTML?

1 Answers1

Linked