I have the following file :
$ cat test
Villes visit\U000000e9es
How can I interpret those \UXXXXXXXX codes, e.g. how can I get :
$ cat test | pipe into something
Villes visitées
I have the following file :
$ cat test
Villes visit\U000000e9es
How can I interpret those \UXXXXXXXX codes, e.g. how can I get :
$ cat test | pipe into something
Villes visitées
With perl:
$ perl -C -pe 's/\\U([[:xdigit:]]{8})/chr hex$1/ge' <yourfile
Villes visitées
Assuming the locale uses UTF-8 as its charmap¹, that would convert \UXXXXXXXX to the UTF-8 encoding of the U+XXXXXXXX character. To get UTF-8 Output regardless of the user's locale, change the -C to -CO.
To convert it to the é character in the correct encoding for the user's locale (assuming there's such a character in the user's locale charset):
perl -Mopen=locale -pe 's/\\U([[:xdigit:]]{8})/chr hex$1/ge' <yourfile
So for instance, in a fr_CH.iso88591 locale, that would convert it to the 0xe9 byte (the encoding of é in ISO8859-1), while in a zh_HK.big5hkscs locale that would convert it to 0x88 0x6d (its encoding in BIG5-HKSCS). And 0xc3 0xa9 in a fr_FR.UTF-8 locale (its UTF-8 encoding). In a ar_AE.iso88596 locale, as ISO8859-6 doesn't have a é character, you'd get Villes visit\x{00e9}es.
Or you could use ICU's uconv (in the icu-devtools package on Debian-based systems) to apply the Hex/C-Any transform :
uconv -x hex/c-any <your-file
It understands \uXXXX and \UXXXXXXXX sequences (more if you use hex-any) and outputs in UTF-8. Pipe to iconv -f utf-8 to get the output in the user's locale (see also iconv's -c option to skip characters that can't be encoded).
$ printf '%s\n' 'é é \x{e9} U+00E9 \u00e9 \U000000e9 \U0001F427 \ud83d\udc27' | uconv -x hex/c-any
é é \x{e9} U+00E9 é é
$ printf '%s\n' 'é é \x{e9} U+00E9 \u00e9 \U000000e9 \U0001F427 \ud83d\udc27' | uconv -x hex-any
é é é é é é
(both also recognise java-style surrogate pairs though that shouldn't occur in your output if it's from cpp -P).
For the perl one to understand both \uXXXX and \UXXXXXXXX like uconv's hex/c-any does, change the perl code to:
s/(?|\\u([[:xdigit:]]{4})|\\U([[:xdigit:]]{8}))/chr hex$1/ge
zsh's print builtin also understands those \uXXXX and \UXXXXXXXX (does not require all the 4/8 digits) and many more, so you could also do:
print -- "$(<your-file)"
You'll get an error if there are characters not present in the locale's charmap.
Some printf implementations also support them for their %b format directive:
printf '%b\n' "$(cat <your-file)"
Like zsh's print, it supports more than just the \u/\U ones, at least the \n/\b/\r..., the \0ooo ones and often more like \xHH.
¹ See output of the locale charmap command; in locales that use other charmaps, what you get is mostly useless in your case. If all the code points on a line are in the 0x0 .. 0xff range, you get the ISO8859-1 encoding (the byte value of the code point), and if not (if there's at least one code point above 0xff in the line), the UTF-8 encoding (and some warnings about it)
uconv as may be seen by executing uconv -L but there is no description or any detail in the uconv man page about such transliterations. Where are they explained ?
– QuartzCristal
Sep 05 '22 at 09:53
Any-Hex/C as the (default) Any-Hex/Java use surrogate code points for characters not in the plane-0. Yes, Hex-Any converts hex codes for other planes, but the reverse (Any-Hex) doesn't.
– QuartzCristal
Sep 05 '22 at 11:49
printf %b like print will expand more than just the \u/\Us.
– Stéphane Chazelas
Sep 05 '22 at 12:15
hex/c-any as hex-any handles all the escapes as generated by all the any-hex/*
– Stéphane Chazelas
Sep 05 '22 at 16:28
\U000000e9with it the UTF-8 encoding of the U+00E9 character? – Stéphane Chazelas Sep 05 '22 at 09:01\U000000e9byé– ChennyStar Sep 05 '22 at 09:04éencoded as per the user's locale? Like as 0xe9 if the locale uses iso8859-1 or as UTF-8 (0xc3 0xa9) regardless of the user's locale? – Stéphane Chazelas Sep 05 '22 at 09:06cpp), something like (in its simplest form)$ echo "villes visitées" | cpp -P, which prints outvilles visit\U000000e9es. – ChennyStar Sep 05 '22 at 09:20\U000000e9is unambiguous. That's the LATIN SMALL LETTER E WITH ACUTE character. The question is more about how you want to represent it, so more about how the output will be used. – Stéphane Chazelas Sep 05 '22 at 09:34