Why doesn't LaTeX use unicode characters for subscripts and superscripts in the context of math?

Question

I'm very new to LaTeX but have found it to be enjoyable to work with. Lately I've been focusing on how to get output that will copy/paste well from PDF with special characters. Currently, XeLaTeX with the Fontspec package seems to do the job rather nicely. However, I notice that subscripts and superscripts within the context of math, when copy/pasted, result in the unicode character for the full-sized glyph rather than the unicode character for the sub or super script. Why is that?

Aside from any other considerations (such as requiring a font that supports them), there are only a very limited number of Unicode superscripts and subscripts. You'd end up with a mixture of sub/super-script glyphs for the supported ones and full-sized glyphs for the unsupported characters. — Nicola Talbot, Apr 30 '18 at 19:34
PDF is not designed for copying from. That you get anything close at all is always a miracle. — ShreevatsaR, Apr 30 '18 at 23:42
Actually it's possible to make copied content different from displayed content, see spacing - In which way have fake spaces made it to actual use? - TeX - LaTeX Stack Exchange and copy paste - Copyable math formulas - TeX - LaTeX Stack Exchange — user202729, Dec 27 '21 at 05:20

score 7 · Answer 1 · answered Apr 30 '18 at 23:16

7

Unicode superscripts are handy for small in-text superscripts such as x² but they mostly would get in the way in math typesetting where you need to position superscripts at different heights depending on the size of the base, or the presence of a subscript, and you want the formatting of x^2 to be consistent with that of x^{(x+\sqrt{y})} which is difficult to achieve if "simple" cases are set by the font machinery using Unicode superscript characters from the base fonts and "complicated" cases are set by the math layout engine using a script sized font.

answered Apr 30 '18 at 23:16

David Carlisle

757,742

I think the question is not exactly about typesetting using Unicode superscript characters, but about why, when one copies from the PDF, the copied text doesn't contain Unicode superscripts. Nevertheless the reasons listed in this answer still apply. – ShreevatsaR Apr 30 '18 at 23:54
can't you improve that: there is still one comma (even two! I am disappointed) in your paragraph, which allowed me to take my breath midway :) – May 01 '18 at 09:44
Aha! This is a helpful reminder to me that LaTeX is first and foremost intended for layout and one should NOT cast cares of data definition upon it. Many thanks to all. – Robert Browder May 03 '18 at 13:00
@RobertBrowder actually I would draw the opposite conclusion from this answer. If you write x^2 or \operatorname{power}(x,2) the 2 is the same thing, the fact that one notation raises it in a smaller font is purely a visual artefact. That said it is more natural to use the same (standard) 2 in the encoding of both if you want to use the same Unicode character for the same semantic data. – David Carlisle May 03 '18 at 13:12
@jfbu IchkönnteesaufDeutschschreibenundkeineKommasoderLeerzeichenhaben. – David Carlisle May 03 '18 at 13:33
@DavidCarlisle thanks for sharing this example. I see what you mean and I agree this is good sense. However, when this information is output to PDF, don't we lose some information by using the Unicode character for the standard 2 in the context of an exponent? When a machine comes along and tries to read this PDF it may be very confused ; ) – Robert Browder May 04 '18 at 20:03
@RobertBrowder machines reading math from the pdf can (and do) infer the superscript nature from the coordinates at which the characters are placed relative to the base. This is far more general (but harder of course) than relying on superscript character which really only make sense for the simplest of exponents, on single character base. – David Carlisle May 04 '18 at 21:48
@DavidCarlisle I must agree that optical character recognition is pretty good these days and getting better. But, not every user or process has the wherewithal to incorporate OCR. Suppose the case of an institutional repository (IR) with a SOLR indexer. The document life cycle might be something like this: LaTeX > PDF > IR > SOLR > researcher. So, documents that use exponents in the title may not index correctly and thus, be difficult to find later. Perhaps adding an OCR step into the ingestion process of the IR is the best we can currently do? – Robert Browder May 08 '18 at 20:40
@RobertBrowder I didn't mean OCR actually (although that's also a possibility) I meant reading the coordinates out of the pdf. the PDF "should" be tagged so the math is all marked up as mathml. one day we'll get there. – David Carlisle May 08 '18 at 20:57

Why doesn't LaTeX use unicode characters for subscripts and superscripts in the context of math?

1 Answers1

Linked