How do I paste codepoints outside of the basic multilingual plane with accsupp or \pdfglyphtounicode?

Question

As described in the answers to this question, the accsupp package can be used to have symbols paste as arbitrary Unicode codepoints. For example you might want to write something like the following, to have blackboard-bold symbols denoting natural numbers, booleans, and strings paste with the Unicode codepoints denoting the corresponding blackboard bold symbols ℕ, , (U+2115, U+1D539, U+1D54A) as opposed to just as "N, B, S":

\documentclass{article}
\usepackage{amsfonts}
\usepackage{accsupp}
  \newcommand*{\setNat}{\BeginAccSupp{method=hex,unicode,ActualText=2115}\mathbb{N}\EndAccSupp{}}
  \newcommand*{\setBool}{\BeginAccSupp{method=hex,unicode,ActualText=1D539}\mathbb{B}\EndAccSupp{}}
  \newcommand*{\setStr}{\BeginAccSupp{method=hex,unicode,ActualText=1D54A}\mathbb{S}\EndAccSupp{}}

\begin{document}
\(\setNat, \setBool, \setStr\)
\end{document}

However the latter two symbols paste incorrectly as ᵓ, ᵔ (U+1D53 and U+1D54). This problem arises whenever Unicode codepoints larger than hexadecimal U+FFFF are used. How can one fix this?

The same issue arises with \pdfglyphtounicode lines (with glyphtounicode.tex); see the above-linked-to question for examples involving ordinary BMP-characters. So I would also like to ask: How can one fix it there?

@HeikoOberdiek Just drawing your attention to this - might be useful to add this to the accsupp documentation. (Maybe all this is obvious to an expert already, but it wasn't to me, even though I know Unicode very well.) :-) — Lover of Structure, Oct 03 '12 at 06:34
The @-notation does only work, if the user has participated in the thread. I have now have seen the comment by accident because of the recent editing. — Heiko Oberdiek, Nov 18 '12 at 03:56
The new version of accsupp (currently here, installation hints inside until the next release of my bundle) adds a new method unichar, where the text can be given as comma separated list of Unicode code point numbers. Numbers outside the BMP are automatically converted to surrogate pairs. — Heiko Oberdiek, Nov 18 '12 at 03:59

Lover of Structure · Accepted Answer · 2012-11-11T05:22:26.717

The accsupp package requires "16-bit" Unicode strings. That is for any character outside of the basic multilingual plane (= BMP = plane 0), i.e. a character with a Unicode codepoint larger than hexadecimal U+FFFF, you need to convert its raw Unicode codepoint to its UTF-16 representation. It so happens that for Unicode codepoints of the form U+HHHH any codepoint is identical to its UTF-16 representation. Higher codepoints though need to be converted into so-called surrogate pair form to get to their UTF-16 representation. For example, FileFormat.info lets you look up such values directly.

The codepoints U+1D539 () and U+1D54A () are D835 DD39 and D835 DD4A (respectively) in UTF-16, so the following code will work:

\documentclass{article}
\usepackage{amsfonts}
\usepackage{accsupp}
  \newcommand*{\setNat}{\BeginAccSupp{method=hex,unicode,ActualText=2115}\mathbb{N}\EndAccSupp{}}
  \newcommand*{\setBool}{\BeginAccSupp{method=hex,unicode,ActualText=D835 DD39}\mathbb{B}\EndAccSupp{}}
  \newcommand*{\setStr}{\BeginAccSupp{method=hex,unicode,ActualText=D835 DD4A}\mathbb{S}\EndAccSupp{}}

\begin{document}
\(\setNat, \setBool, \setStr,\)
\end{document}

glyphtounicode.tex's \pdfglyphtounicode behaves the same: it requires UTF-16. That is, one needs to write \pdfglyphtounicode{somename}{D835 DD39} instead of \pdfglyphtounicode{somename}{1D539}.

Note that of course both accsupp (with method=hex,unicode) and \pdfglyphtounicode accept an arbitrarily long sequence of UTF-16 values.

How do I paste codepoints outside of the basic multilingual plane with accsupp or \pdfglyphtounicode?

1 Answers1

Linked