interpret \UXXXXXXXX Unicode codes in text file

Question

I have the following file :

$ cat test
Villes visit\U000000e9es

How can I interpret those \UXXXXXXXX codes, e.g. how can I get :

$ cat test | pipe into something
Villes visitées

That's not a UTF-8 code, that's a C-style representation of a character by its Unicode codepoint. Maybe you mean that you want to replace that \U000000e9 with it the UTF-8 encoding of the U+00E9 character? — Stéphane Chazelas, Sep 05 '22 at 09:01
Yes, that's what I want, replace the C-style \U000000e9 by é — ChennyStar, Sep 05 '22 at 09:04
The question is by é encoded as per the user's locale? Like as 0xe9 if the locale uses iso8859-1 or as UTF-8 (0xc3 0xa9) regardless of the user's locale? — Stéphane Chazelas, Sep 05 '22 at 09:06
I think it's by the user's locale. It's a file that's produced by someone else, thus I don't have all the details. But it's produced using C preprocessor (cpp), something like (in its simplest form) $ echo "villes visitées" | cpp -P, which prints out villes visit\U000000e9es. — ChennyStar, Sep 05 '22 at 09:20
To clarify, \U000000e9 is unambiguous. That's the LATIN SMALL LETTER E WITH ACUTE character. The question is more about how you want to represent it, so more about how the output will be used. — Stéphane Chazelas, Sep 05 '22 at 09:34

Stéphane Chazelas · Accepted Answer · 2022-09-05T19:56:10.693

With perl:

$ perl -C -pe 's/\\U([[:xdigit:]]{8})/chr hex$1/ge' <yourfile
Villes visitées

Assuming the locale uses UTF-8 as its charmap¹, that would convert \UXXXXXXXX to the UTF-8 encoding of the U+XXXXXXXX character. To get UTF-8 Output regardless of the user's locale, change the -C to -CO.

To convert it to the é character in the correct encoding for the user's locale (assuming there's such a character in the user's locale charset):

perl -Mopen=locale -pe 's/\\U([[:xdigit:]]{8})/chr hex$1/ge' <yourfile

So for instance, in a fr_CH.iso88591 locale, that would convert it to the 0xe9 byte (the encoding of é in ISO8859-1), while in a zh_HK.big5hkscs locale that would convert it to 0x88 0x6d (its encoding in BIG5-HKSCS). And 0xc3 0xa9 in a fr_FR.UTF-8 locale (its UTF-8 encoding). In a ar_AE.iso88596 locale, as ISO8859-6 doesn't have a é character, you'd get Villes visit\x{00e9}es.

Or you could use ICU's uconv (in the icu-devtools package on Debian-based systems) to apply the Hex/C-Any transform :

uconv -x hex/c-any <your-file

It understands \uXXXX and \UXXXXXXXX sequences (more if you use hex-any) and outputs in UTF-8. Pipe to iconv -f utf-8 to get the output in the user's locale (see also iconv's -c option to skip characters that can't be encoded).

$ printf '%s\n' '&#233; &#xe9; \x{e9} U+00E9 \u00e9 \U000000e9 \U0001F427 \ud83d\udc27' | uconv -x hex/c-any
&#233; &#xe9; \x{e9} U+00E9 é é  
$ printf '%s\n' '&#233; &#xe9; \x{e9} U+00E9 \u00e9 \U000000e9 \U0001F427 \ud83d\udc27' | uconv -x hex-any  
é é é é é é

(both also recognise java-style surrogate pairs though that shouldn't occur in your output if it's from cpp -P).

For the perl one to understand both \uXXXX and \UXXXXXXXX like uconv's hex/c-any does, change the perl code to:

s/(?|\\u([[:xdigit:]]{4})|\\U([[:xdigit:]]{8}))/chr hex$1/ge

zsh's print builtin also understands those \uXXXX and \UXXXXXXXX (does not require all the 4/8 digits) and many more, so you could also do:

print -- "$(<your-file)"

You'll get an error if there are characters not present in the locale's charmap.

Some printf implementations also support them for their %b format directive:

printf '%b\n' "$(cat <your-file)"

Like zsh's print, it supports more than just the \u/\U ones, at least the \n/\b/\r..., the \0ooo ones and often more like \xHH.

^{¹ See output of the locale charmap command; in locales that use other charmaps, what you get is mostly useless in your case. If all the code points on a line are in the 0x0 .. 0xff range, you get the ISO8859-1 encoding (the byte value of the code point), and if not (if there's at least one code point above 0xff in the line), the UTF-8 encoding (and some warnings about it)}

There are a lot of transliterations available in uconv as may be seen by executing uconv -L but there is no description or any detail in the uconv man page about such transliterations. Where are they explained ? — QuartzCristal, Sep 05 '22 at 09:53
@QuartzCristal, deep in the ICU documentation. I'm always dreading getting there. — Stéphane Chazelas, Sep 05 '22 at 09:55
This could be of help: https://util.unicode.org/UnicodeJsps/transform.jsp?a=Latin&b=%CE%B4%CE%B9%CE%B1%CF%86%CE%BF%CF%81%CE%B5%CF%84%CE%B9%CE%BA%CE%BF%CF%8D%CF%82&show=on — QuartzCristal, Sep 05 '22 at 11:46
Perhaps we should use Any-Hex/C as the (default) Any-Hex/Java use surrogate code points for characters not in the plane-0. Yes, Hex-Any converts hex codes for other planes, but the reverse (Any-Hex) doesn't. — QuartzCristal, Sep 05 '22 at 11:49
Yes, printf %b like print will expand more than just the \u/\Us. — Stéphane Chazelas, Sep 05 '22 at 12:15
@QuartzCristal, I ended up switching to hex/c-any as hex-any handles all the escapes as generated by all the any-hex/* — Stéphane Chazelas, Sep 05 '22 at 16:28

interpret \UXXXXXXXX Unicode codes in text file

1 Answers1