How many LaTeX characters have Unicode equivalents, and which characters and mathematical character combinations cannot be represented by Unicode?

Question

I would like to know how many LaTeX characters, including all the special math symbols, can be represented by Unicode. Would I be right to say, that nowadyas, LaTeX's strangths over Unicode are mainly its ability to draw diagrams, which cannot be created with Unicode alone?

Also, if for example I need to typeset something like "integral from a to b" with subscripts and suprescripts, or even "e to the power of (x to the power of y)", then can I do this with Unicode or do I just need LaTeX in order to generate the smaller superscripts and subscripts?

Also, to what extent can I combine LaTeX with Unicode math characters and would the typesetting be the same if I use LaTeX notation to generate math characters rather than Unicode characters serving the same purpose?

Thanks.

It's not quite clear what is the context of your question, (la)tex is a typesetting system and unicode is a character encoding. So whether the characters come from a collection of 256 character small fonts or from one large unicode font you still need a math capable typesetting system to typeset the mathematics — David Carlisle, Feb 07 '15 at 16:14
as for using unicode fonts and unicode input with luatex or xetex (unicode TeX variants) There are good answers already I'll find one and post here — David Carlisle, Feb 07 '15 at 16:15
see http://tex.stackexchange.com/questions/118244/what-is-the-difference-between-unicode-math-and-mathspec — David Carlisle, Feb 07 '15 at 16:16
Thank you for your pointers, but I was really looking for a list of limitations of using Unicode to enter math as opposed to using any flavor of LaTeX. — John Sonderson, Feb 07 '15 at 16:34
But as I say Unicode is just a list of characters, you need a typesetting system, unicode says nothing about how to set matrices or fractions, it's like asking for plain english text whether you need latex or ascii, they just are not comparable. — David Carlisle, Feb 07 '15 at 16:39
To put it another way, if you don't use tex as the typesetting engine, what would you use, Word?, an HTML browser? Any of these are possible, and you can ask about the relative merits of math typesetting of such systems, but just with Unicode and no typesetter you can not typeset hello never mind \sqrt{x} — David Carlisle, Feb 07 '15 at 16:45
@DavidCarlisle - The OP may (or may not...) be thinking of entering \xi versus its unicode-encoded sibling , \Gamma versus Γ, etc. — Mico, Feb 07 '15 at 17:03
@DavidCarlisle - I'm guessing the OP is thinking of entering and Γ directly into a LaTeX docuemnt, rather than having to type \xi and \Gamma. At least, that's what I'm guessing the posting's third paragraph is all about. — Mico, Feb 07 '15 at 17:28
Yes, @Mico understood what I was asking about. I am also unclear at the moment about how well Unicode supports entering braces of different sizes (for example for entering matrices, etc...), multiple stacked subscripts or superscripts, \stackrel effects where you can insert something above an equal sign or other symbols, etc... I don't know how the Unicode support for these special math effects works very well. Thanks. — John Sonderson, Feb 07 '15 at 23:46
@DavidCarlisle, how sure are you that Unicode can't typeset \sqrt{x}? I thought Unicode had special Unicode formatting characters used for such purposes. Am I wrong here? If it doesn't then Unicode presents a severe disadvantage. — John Sonderson, Feb 07 '15 at 23:48
Unicode itself is just a list of characters it can't typeset anything. It is like ascii but with a bigger list. Even if you are writing in English, ascii is not enough you need latex or Word or something to arrange the letters. there is unicode report TR28 which is essentially the linear format in Word, and as the input format to Word needs that program to typeset the markup (which is similar to TeX with most \ omitted perhaps you are thinking of this, but it isn't Unicode it is MS Word input form. http://unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf — David Carlisle, Feb 07 '15 at 23:55
Well, I was thinking about section 2.8 of this document, but should have looked better. Apparently Unicode has separate character encodings for certain superscript and subscript characters, but that means you can't turn anything you like into a superscript or subscript as that would triple the size of unicode itself, besides, you cant make superscripts of superscripts and so on and so forth, never mind how such superscripts are nested: it's a poor approach, and in fact the Unicode I mentioned discourages such use of Unicode. Thanks. — John Sonderson, Feb 08 '15 at 00:06

score 10 · Accepted Answer · answered Feb 07 '15 at 21:45

10

partial, and probably unsatisfactory, answer.

unicode alone can't do everything. for example, if you want an integral from x=1 to \infty, unicode has the codes, but it isn't by itself able to position sub/superscripts or limits. so at a minimum, some sort of markup and composition facility is required.

markup could as well be mathml as latex, but that's up to whoever is preparing the document.

as for whether "all" latex characters/symbols are covered by unicode, the effort made for the benefit of stipub (see http://www.ams.org/STIX for the history of the stix project) attempted to get as many such symbols as possible accepted into unicode. if a symbol was requested by one of the stipub organizations, then it went onto the list, and by and large the unicode technical committee received that request as an acceptable level of documentation. for some edge cases (some symbols in the stmaryrd collection or in tipa, for example) which were not on the main stipub lists, additional documentation -- in the form of articles or books published by recognized technical publishers -- was required, and in its absence, no action was taken. (if someone can provide a suitable citation for any "missing" symbol, the effort to add new symbols is ongoing.)

what did happen is that the unicode technical committee accepted the proposition that math notation is effectively a "language", and as such, symbols in common use should be encoded just as letters for "minor" human languages, alive or dead, are encoded. this is what is required for mathematicians and other scientists to communicate on the web.

i am not aware that a complete list of symbols, with their visual representation and associated unicode (and, potentially, a "tex name") exists yet. i hope that this information can be added to the "comprehensive symbols list" (texdoc comprehensive), but that is a massive undertaking (in which i am willing to participate, but haven't yet contacted the author to that effect). and some glitches in the stix fonts, which were the outcome of the stix project, remain to be ironed out, in particular the location of quite a few "unicodes" in the private use area.

regarding direct use of unicodes or the "native" symbols vs. "tex names", ability to do so depends on the engine in use. it's probably not possible with pdflatex, but should be relatively straightforward with xelatex provided suitable fonts are available.

answered Feb 07 '15 at 21:45

barbara beeton

88,848

1

For one that's not that I know of on the list, see the standard state symbol used in chemistry. – Joseph Wright Feb 07 '15 at 21:58
1

At the end of Symbol.java there are some mappings from LaTeX commands to Unicode and Ding.java has the mappings from pifont's \ding command to Unicode. I also made a start on wasysym but it's not finished. – Nicola Talbot Feb 07 '15 at 22:18
1

If you hover over the top of each cell in these tables it shows the unicode slot and the unicode name and any xml entity names, I could give tex names as well, if that would make sense, the tex data is in the xml source file for those tables. – David Carlisle Feb 08 '15 at 00:02
@DavidCarlisle, very useful table, but I couldn't spot the normal cartesian product symbol \times, not sure where to find it. My intuition is that there must be other Unicode tables of mathematical operators as well. Any idea where? – John Sonderson Feb 08 '15 at 14:57
@JohnSonderson for historical reasons that isn't in a math block it is in the latin 1 block at U+00D7 those tables (if you navigate the prev and next links) include every character marked in unicode for math use, plus all the common latin greek cyrillic and arabic characters. the appendix at the end links to unicode.xml whichis the source and lists every character in Unicode 7, along with its names in xml entity sets and common latex packages, and various other bits of information – David Carlisle Feb 08 '15 at 15:03
Thank you @DavidCarlisle for pointing that out. Still a couple of things. Why don't the \mathbb{R} and \mathbb{Z} symbols (and others) show up under this link. I've tried both Firefox and Chrome, but some character cells are blank (display an orange background and nothing else). Also, where are the Russian letters in the list, I couldn't spot them (several math books make use of Russian letters as well)? Thanks. – John Sonderson Feb 08 '15 at 15:39
This up-to-date link displays the same problem. I can't believe the space where \mathbb{R} has been reserved and unassinged. I don't understand the Unicode team's logic in leaving these letters out. Perhaps I'm missing something. – John Sonderson Feb 08 '15 at 15:46
1

@JohnSonderson ℝ (U+211D) and ℤ (U+2124) are in the "Letterlike Symbols" block. – Nicola Talbot Feb 08 '15 at 15:52
1

@JohnSonderson the common number sets NPCRQ etc were already in unicode, and when the full alphabets were added, the rule that no existing characters are duplicated won. A bit unfortunate but that's what happens when you deal with legacy data – David Carlisle Feb 08 '15 at 15:54
1

cyrillic is http://www.w3.org/2003/entities/2007doc/004.html – David Carlisle Feb 08 '15 at 15:55
I was also able to find a list of the aforementioned letterlike symbols Unicode block here. Thank you for pointing these out. – John Sonderson Feb 08 '15 at 16:01
1

@JohnSonderson -- david and nicola have already given the historical "justification" for the presence of "holes" in the unicode plane 1 "mathematical alphanumerics" block. unicode never, never inserts duplicates, although the same shape may be attached to two different unicodes if they have different meanings (and follow different spacing rules in print). the unicode technical report #25, unicode support for mathematics covers this on p.7, in the paragraph following table 2.1. – barbara beeton Feb 08 '15 at 16:01
Thanks @barbarabeeton. I now understand why some Unicode sequences (such as those I was asking about) do not correspond to any actual character. – John Sonderson Feb 08 '15 at 16:03

score 5 · Answer 2 · answered Feb 08 '15 at 12:40

5

Many of the applications I write have some connection to LaTeX. Either they create files that LaTeX inputs or they parse LaTeX code, so I've had to produce some mappings between Unicode and LaTeX commands.

flowframtk is a graphical application that can export to .tex (or .sty or .cls) and Unicode characters entered into the graphical environment can be mapped to LaTeX commands. If you install flowframtk, run the application and then quit it, you should find a directory containing the application settings (~/.flowframtk/ on Unix-like systems or flowframtk-settings on Windows). This directory should include (amongst other files) the text-mode (textmappings.prop) and math-mode (mathmappings.prop) mappings. The files are tab-separated with three columns. The first has the Unicode code point, the second the closest LaTeX equivalent and the third the package(s) required. The files are too large to reproduce here. (The text mode mappings has 200 lines and the math mode mappings has 795 lines.)

The texparser library is designed to parse LaTeX files, but also contains mappings from LaTeX to Unicode, although these are contained within the Java source code.

answered Feb 08 '15 at 12:40

Nicola Talbot

41,153

So, given your comments, would I be correct to say that unicode-math is not a package designed primarily for end users, but rather a package designed for tools which make use of it to more easily convert back and forth between one format and another? – John Sonderson Feb 08 '15 at 14:59
1

@JohnSonderson I don't know. I've never used unicode-math, but I can see that it would be useful for people who are accustomed to typing Unicode characters. (Personally, I find it easier to type, say, \mathbb{Z} rather than ℤ.) My TeXparser library is designed to work with (PDF)LaTeX rather than XeLaTeX/LuaLaTeX as that's the format required for the articles I have to process, and if flowframtk didn't have the mappings, it would put off people who specifically want to use PDFLaTeX rather than XeLaTeX/LuaLaTeX. – Nicola Talbot Feb 08 '15 at 16:05
Yes, besides the fact that both having to remember by heart as well as having to type more numbers and less letters makes entering Unicode characters impractical (unless someone can answer this post with a viable answer. So, yes, AFAIK entering non-Unicode is better, also because some text editors cannot properly render all Unicode text (at least not with most fonts I'm aware of including some default ones). – John Sonderson Feb 08 '15 at 16:12

How many LaTeX characters have Unicode equivalents, and which characters and mathematical character combinations cannot be represented by Unicode?

2 Answers2

Linked