4

I’d like to use \DeclareUnicodeCharacter to define mappings of Unicode characters, represented in decimal or hexadecimal form, with alternative expressions or graphics that should replace the Unicode characters. For example:

\DeclareUnicodeCharacter{014F}{\u{o}}

While this specific example works fine, Unicode characters cannot just be single code points but also grapheme clusters, i.e. sequences of multiple code points that form a unit and a single visible character. Example:

U+006E U+0303 = ñ (sometimes, there are equivalents like U+00F1)

It seems the command \DeclareUnicodeCharacter comes from the inputenc package and supports values between 0 and 10FFFF only, which is enough for single code points, but may not provide any means for composed grapheme clusters. But when using XeLaTeX, the implementation does not come from inputenc, right?

So with inputenc or with a “native” implementation, is there any way to map grapheme clusters instead of just single code points? For example:

\DeclareUnicodeCharacter{006E0303}{...}
# or
\DeclareUnicodeCharacter{006E,0303}{...}

Edit:

The use case is something like \DeclareUnicodeCharacter or \newunicodechar (perhaps without a complete extra package), but for units of multiple code points instead of just single code points, in order to create custom mappings.

It seems TECkit mappings, referenced in the Mapping attribute of fontspec, may provide the exact functionality (including multiple code points being mapped) (Edit: but only to “plain text”, not to commands, apparently), but is not elegant, not contained in the same text/source file, and requires separate tooling.

There’s also \XeTeXinterchartoks, but this doesn’t really make definitions easy to write, especially for multiple individual grapheme clusters (as opposed to character blocks).

caw
  • 318
  • Please add an example document to clarify this question. \DeclareUnicodeCharacter is not defined in xelatex, and inputenc is not usable with xelatex, both are for classic 8-bit TeX systems such as pdflatex. – David Carlisle Jan 06 '22 at 07:45
  • 1
    Unicode combining characters and other text shaping clusters should work in xelatex via the harfbuzz opentype font handling, they don't need declarations at the tex macro level. – David Carlisle Jan 06 '22 at 07:49
  • Yes it's possible, but are you sure you want to do serious programming in TeX the programming language? Check out how unicode-math scans forward for superscript characters, or https://tex.stackexchange.com/a/310035/250119 for example of similar tasks that can be done. Better to just do \def\ntilde{...} and/or do regex search/replace in your editor. – user202729 Jan 06 '22 at 10:10
  • @DavidCarlisle Thanks, and sorry for the incomplete explanations. I know that I don’t need to define individual Unicode characters (because XeLaTeX supports them natively), but the point is that I want to map them (to something else). Since \DeclareUnicodeCharacter would only lack support for pairs of code points for this use case (even though not in XeLaTeX), I thought there would be something similar that allows me to create custom mappings. – caw Jan 06 '22 at 14:13
  • @user202729 Thank you! I definitely do not want to do serious programming in TeX, but I thought once I understood the mechanism to define these mappings, it would only be a few lines that I could copy for each individual mapping. I have been looking into the source of ucharclasses, since that must be doing something similar, and it may be XeTeXinterchartoks. – caw Jan 06 '22 at 14:19
  • No I mean you say \DeclareUnicodeCharacter works eg \DeclareUnicodeCharacter{014F}{\u{o}} Please show what you are doing as that should give errors with xelatex that the command is undefined. That is, I stopped understanding the question at "While this specific example works fine" – David Carlisle Jan 06 '22 at 15:32
  • @DavidCarlisle I’m sorry, I was talking about LaTeX generally, not XeLaTeX specifically there, and did not expect this command to be specific to some engines. That was stupid. I only ran it in a MWE online and may have let it compile using pdfLaTeX. So it does not actually work in XeLaTeX, but all the rest remains: I would like something that works like this – and lets me map sequences of code points instead of just single code points. – caw Jan 07 '22 at 05:55
  • you may want to change the title as the answer to the question asked in the title is just "No" – David Carlisle Jan 07 '22 at 09:29

3 Answers3

5

You need nothing of that kind and it works out of the box.

Here's an example with plain XeTeX, in order to show that nothing at all is needed. In the first call I use n followed by U+0303 COMBINING TILDE, in the second call I use directly ñ.

%%% print the Unicode point of the given string and the result
\def\test#1{%
  \unicodestring#1\relax
  --- #1
}

\font\testfont="[Junicode.ttf]:mapping=tex-text"

\def\unicodestring#1#2{% (\the\numexpr#1\relax)\space \ifx#2\relax\else (\the\numexpr#2\relax)\space \fi }

\testfont

\test{ñ}

\test{ñ}

\bye

enter image description here

egreg
  • 1,121,712
  • Thank you! If I’m in fact interested in keeping the possibility to define mappings, and to work with the code point values, the visual output here alone suggests that this is possible, right? So can the \def\unicodestring block be used to create a replacement for \DeclareUnicodeCharacter where #1 and #2 give me access to the two individual code points, which \DeclareUnicodeCharacter cannot do? – caw Jan 06 '22 at 13:31
  • @caw The point is that you don't need to access the two characters, why would you? – egreg Jan 06 '22 at 13:50
  • Sorry, I know that Xe(La)TeX has great Unicode support and lets me use Unicode input directly. But as I wrote in the first sentence of the question, I’d like to “define mappings of Unicode characters […] [to] alternative expressions or graphics that should replace the Unicode characters”. So the sequence U+… U+… should be mapped to my company logo (as a graphic), or should be set to use a different font, or be replaced with a few LaTeX commands, etc. I know most people don’t need this, but since \DeclareUnicodeCharacter almost gets me there, I thought this was possible. – caw Jan 06 '22 at 14:09
  • 1
    @caw TeX looks at one token at a time. You might make n active, for instance, but this would break any macro name using n. – egreg Jan 06 '22 at 14:21
  • That’s a pity, thanks. but TeX does process ligatures, doesn’t it? (By the way, the package newunicodechar, which specifically works with Unicode engines as well, “proves” the use case, I guess. But that also works with single code points only, as far as I can see.) – caw Jan 06 '22 at 14:36
  • 2
    @caw Ligatures (and also combining characters in XeTeX) are dealt with much later in the processing, when the user no longer can control the process with macros (also those of \newunicodechar). You might do something with LuaTeX, possibly. – egreg Jan 06 '22 at 14:43
  • Thank you! That means I could use custom ligatures (if possible) to map Unicode sequences to other “plain text” (at a very late stage, no problem), but not to LaTeX commands (e.g. for switching fonts, including graphics), right? – caw Jan 06 '22 at 14:49
  • 1
    @caw Yes, to characters in the same font, via a mapping file with teckit. There are answers about this on the site. – egreg Jan 06 '22 at 15:08
3

The mapping is the input that you type. (So you will need an input method. Simplest method is direct input.)

OpenType font files contain the rules for compound glyphs (ligatures). The font shaping engine (e.g., HarfBuzz when using Xelatex) applies the rules.

You do not need to re-invent the wheel.

combining

MWE

\documentclass{article}
\usepackage{xcolor}
\usepackage{fontspec}
\setmainfont{Noto Serif}\newcommand\textnote[1]{{\color{blue}$\leftarrow$ #1}}
\begin{document}
\symbol{110} + \ \symbol{771} = \symbol{110}\symbol{771} : (U+006E U+0303)

\symbol{241} : (U+00F1)

a\symbol{771} b\symbol{771} c\symbol{771} d\symbol{771} e\symbol{771} \textnote{Combining Diacritical Mark}

ᲀᲁᲂᲃᲄᲅᲆᲇᲈ \textnote{Cyrillic Extended-C}

\end{document}


Added

For plain xetex, you can use ^^^^ notation if you have no keyboard overlay, OS language choice, character map, regex replace etc.

circumflex hex

MWE

%xetex
\font\testfont="[NotoSerif-Regular.ttf]:mapping=tex-text"

\testfont

^^^^006e^^^^0303 using \space ^^^^0302 ^^^^0302 ^^^^0302 ^^^^0302 notation.

\bye

Cicada
  • 10,129
  • Thank you! I know about Xe(La)TeX’s awesome Unicode support, but was specifically interested in having custom mappings, i.e. mappings of Unicode sequences to different output, e.g. simple commands, including graphics, other Unicode characters, etc. – caw Jan 06 '22 at 14:03
  • 1
    @caw Yes, you can create a teckit mapping file for xelatex to use. It can handle multi-stepped mapping passes and combinations (to output font glyphs). Alternatively, use regex find/replace expressions in expl3 syntax to output whatever you want (including code and commands). – Cicada Jan 06 '22 at 16:20
3

Your question is confusing as \DeclareUnicodeCharacter is not defined for xelatex.

However if I understand the use case it is possible (although I don't really recommend it) to use XeTeX character classes.

this detects just n followed by a combining tilde and replaces the pair by an \fbox construct.

enter image description here

\documentclass{article}

\XeTeXinterchartokenstate=1

\newXeTeXintercharclass\nclass

\XeTeXcharclass `\n \nclass

\XeTeXinterchartoks 0 \nclass = {\ntest} \XeTeXinterchartoks 4095 \nclass = {\ntest}

\def\ntest#1{\futurelet\next\ntestb}

\def\ntestb{% \ifx\next ^^^^0303% \fbox{An n-tilde combining pair}% \expandafter\eatnext \else {\XeTeXinterchartokenstate=0 n}% \fi}

\def\eatnext#1{}

\begin{document}

noo ñ abcñxyz oon

\end{document}

David Carlisle
  • 757,742
  • Thank you so much! You understood the use case perfectly right and this is exactly what I wanted – just not as elegant or concise as I might have hoped for. (And sorry for the confusion!) – caw Jan 07 '22 at 06:00