latex3 — issues with cyrillic

Question

For some reason the following code

\documentclass{article}
\usepackage[english,bulgarian]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T2A]{fontenc}
\begin{document}
Здравей
\ExplSyntaxOn
\tl_set:Nx \l_tmpa_tl {Здравей}
\tl_map_inline:Nn \l_tmpa_tl { #1 ~ }
\ExplSyntaxOff
\end{document}

leads to bunch of errors

Looks like it's the problem with \tl_map_inline, since storing and using token list with cyrillic works fine.

Note that I'm using TeXLive 2022 while on 2019 it the code above works perfectly.

How can I fix the issues right now on 2022 setup?

using x-expansion on arbitrary text is never a good idea. Apart from this: utf8 chars are now protected. Use \tl_show:N\l_tmpa_tl to see the difference. — Ulrike Fischer, Aug 03 '22 at 14:11
@DavidCarlisle we german like long, posh command names without spaces ;-) — Ulrike Fischer, Aug 03 '22 at 14:15

David Carlisle · Accepted Answer · 2022-08-03T17:07:59.347

4

If you use

\documentclass{article}
\usepackage[english,bulgarian]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T2A]{fontenc}
\begin{document}
Здравей
\ExplSyntaxOn
\tl_set:Nx \l_tmpa_tl {Здравей}
\show\l_tmpa_tl
\tl_map_inline:Nn \l_tmpa_tl { #1 ~ }
\ExplSyntaxOff
\end{document}

You will see the current release defines

> \l_tmpa_tl=macro:
->Здравей.

older releases

> \l_tmpa_tl=macro:
->\T2A\CYRZ \T2A\cyrd \T2A\cyrr \T2A\cyra \T2A\cyrv \T2A\cyre \T2A\cyrishrt .

Usually the new version is to be preferred, what is the actual use case here, there is probably a way to achieve it?

A quick fix would be this which keeps each two-byte UTF-8 pair in a group:

> \l_tmpa_tl=macro:
->{З}{д}{р}{а}{в}{е}{й}.

\documentclass{article}
\usepackage[english,bulgarian]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T2A]{fontenc}
\begin{document}
Здравей
\ExplSyntaxOn
\def\uviii#1#2{\ifx\relax#1\else{#1#2}\expandafter\uviii\fi}
\tl_set:Nx \l_tmpa_tl {\uviii Здравей\relax\relax}
%\show\l_tmpa_tl
\tl_map_inline:Nn \l_tmpa_tl { #1 ~ }
\ExplSyntaxOff
\end{document}

\documentclass{article}
\usepackage[english,bulgarian]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T2A]{fontenc}

\begin{document}

Здравей

\ExplSyntaxOn
\def\uviii#1{\ifx\relax#1\else
\ifnum\expandafter`\string#1<128~
\expandafter\expandafter\expandafter\uviiia\else
\expandafter\expandafter\expandafter\uviiib
\fi
\fi
#1}
\def\uviiia#1{#1\uviii}
\def\uviiib#1#2{{#1#2}\uviii}
\tl_set:Nx \l_tmpa_tl {\uviii Здравейabc\relax}
%\show\l_tmpa_tl
\tl_map_inline:Nn \l_tmpa_tl { #1 ~ }
\ExplSyntaxOff

\end{document}

edited Aug 03 '22 at 17:07

answered Aug 03 '22 at 14:12

David Carlisle

757,742

I consider this as a bug. Moreover, it seems like something goes wrong with the encoding when expl3 parses cyrillic, like the packages for their support are ignored by latex3. Try to use str type instead of tl and it will work, but yield this: https://i.stack.imgur.com/4XWS2.png. I saw such random set of letter many times and it has always been caused by encoding problems. – antshar Aug 03 '22 at 15:41
The actual use cause is spreadlettering from my old question. Fortunately, another presented approach still works, so I will go with that, but the problem with tl really has to be fixed inside latex3. – antshar Aug 03 '22 at 15:43
1

@antshar it is not a bug so much as a change \tl_map_inline:Nn has always iterated over pdftex tokens so would break multi-byte encodings such as utf-8. The fact that previously for that specific text you got one token per character after the x expansion was luck not design (not true in general, eg not for French or other accented latin, where this could never have workd previously) . – David Carlisle Aug 03 '22 at 15:51
@antshar That said we will probably expose a new text_... version of the iterator that steps along utf-8 character groups not tokens, which would be what you want here. (Such loops exist in the l3 upper/lower case code, but are not in a form callable from elsewhere currently) – David Carlisle Aug 03 '22 at 15:51
1

In addition to mentioned issues, \w regex specifier for matching letters doesn't work either. – antshar Aug 03 '22 at 16:33
@antshar if you want utf8 chars to be "letters", use an unicode engine, with pdftex you will have to accept that they often consist of more than one byte and of commands. – Ulrike Fischer Aug 03 '22 at 16:38
Btw, your workaround with \uviii works with cyrillic only, so if text consist of latin letters, it doesn't work as expected. – antshar Aug 03 '22 at 16:38
@UlrikeFischer why did someone even decide to give up that old construction with \cyr• in favor to actual unicode characters? – antshar Aug 03 '22 at 16:41
@antshar yes as I said, that version is specific to two-byte text, a full version would need to detect 1,2, 3 or 4 byte ranges, but not here. Your original version in older releases would essentially just work for a limited range of cyrillic as well. – David Carlisle Aug 03 '22 at 16:44
the \cyr... commands are still used, to specify typesetting, but utf-8 is held as characters for longer, not least so you can have filenames or \labels with cyrillic where Здравей works but \T2A\CYRZ \T2A\cyrd \T2A\cyrr \T2A\cyra \T2A\cyrv \T2A\cyre \T2A\cyrishrt does not. @antshar – David Carlisle Aug 03 '22 at 16:46
Is there anything I can temporary do now to make a quick fix for that and obtain the old behavior for both latin and cyrili letters? – antshar Aug 03 '22 at 16:52
@antshar added. – David Carlisle Aug 03 '22 at 17:08
1

@antshar why the bug report???? As I explained the change is to introduce much more consistent handling of non latin letters, and your earlier code only worked by accident over a very limited set of letters. In earlier releases most characters would fail in an x argument and tl_map maps over tokens so would break any characters using more than one token. It happened to work for the text in your example, but the new utf support is far more complete and consistent. – David Carlisle Aug 03 '22 at 18:55

score 4 · Answer 2 · answered Aug 03 '22 at 19:24

The other answers cover why this is happening. It is likely that the team will add a 'text mapping' function soon - as a model for the present:

\documentclass{article}
\usepackage[T2A]{fontenc}
\ExplSyntaxOn
\cs_new:Npn \text_map_function:nN #1#2
  { \exp_args:Ne \__text_map_function:nN { \text_expand:n {#1} } #2 }
\cs_new:Npn \__text_map_function:nN #1#2
  {
    \__text_map_function:Nw #2 #1
      \q__text_recursion_tail \q__text_recursion_stop
    \prg_break_point:Nn \text_map_break: { }
  }
\bool_lazy_or:nnTF
  { \sys_if_engine_luatex_p: }
  { \sys_if_engine_xetex_p: }
  {
    \cs_new:Npn \__text_map_function:Nw #1#2
      {
        \__text_if_recursion_tail_stop_do:Nn #2 { \text_map_break: }
        #1 {#2}
        \__text_map_function:Nw #1
      }
  }
  {
    \cs_new:Npn \__text_map_function:Nw #1#2
      {
        \__text_if_recursion_tail_stop_do:Nn #2 { \text_map_break: }
        \bool_lazy_and:nnTF
          { \tl_if_single_token_p:n {#2} }
          { ! \token_if_cs_p:N #2 }
          {
            \int_compare:nNnTF { `#2 } > { "80 }
              {
                \int_compare:nNnTF { `#2 } < { "E0 }
                  { \__text_map_function:NNN }
                  {
                     \int_compare:nNnTF { `#2 } < { "F0 }
                       { \__text_map_function:NNNN }
                       { \__text_map_function:NNNNN }
                  }
              }
              { \__text_map_function:Nn }
                #1 #2
          }
          { \__text_map_function:Nn #1 {#2} }
      }
    \cs_new:Npn \__text_map_function:NNN #1#2#3
      { \__text_map_function:Nn #1 {#2#3} }
    \cs_new:Npn \__text_map_function:NNNN #1#2#3#4
      { \__text_map_function:Nn #1 {#2#3#4} }
    \cs_new:Npn \__text_map_function:NNNNN #1#2#3#4#5
      { \__text_map_function:Nn #1 {#2#3#4#5} }
    \cs_new:Npn \__text_map_function:Nn #1#2
      {
        #1 {#2}
        \__text_map_function:Nw #1
      }
  }
\cs_new:Npn \text_map_break:
  { \prg_map_break:Nn \text_map_break: { } }
\cs_new:Npn \text_map_break:n
  { \prg_map_break:Nn \text_map_break: }
\cs_new_protected:Npn \text_map_inline:nn #1#2
  {
    \int_gincr:N \g__kernel_prg_map_int
    \cs_gset_protected:cpn
      { __text_map_ \int_use:N \g__kernel_prg_map_int :w } ##1 {#2}
    \exp_args:Nnc \text_map_function:nN {#1}
      { __text_map_ \int_use:N \g__kernel_prg_map_int :w }
    \prg_break_point:Nn \text_map_break:
      { \int_gdecr:N \g__kernel_prg_map_int }
  }
\ExplSyntaxOff
\begin{document}
\ExplSyntaxOn
\text_map_inline:nn { Здравей } { #1 ~ }
\ExplSyntaxOff
\end{document}

Looking forward to seeing this function being shipped with latex3. Thank you for writing it down for my question. — antshar, Aug 03 '22 at 22:27

score 1 · Answer 3 · answered Aug 03 '22 at 17:04

hyperref has to unprotect the utf8 chars locally to write the bookmarks. You can copy its code:

\documentclass{article}
\usepackage[T2A]{fontenc}
\begin{document}
\makeatletter
\def\antshar@expand@utfvii{%
    \count@"C2
    @tempcnta"F5
    \def\UTFviii@tmp{\expandafter\def\expandafter~\expandafter{~}}%
    \UTFviii@loop
  }
\ExplSyntaxOn
\group_begin:
\antshar@expand@utfvii
\tl_set:Nx \l_tmpa_tl {Здравей}
\tl_map_inline:Nn \l_tmpa_tl { #1 ~ }
\group_end:
\ExplSyntaxOff
\end{document}

Looks really neat! I keep delighting hyperref's clever solution that stays under the hood, but some of the approaches can actually be useful as standalone solution. — antshar, Aug 03 '22 at 22:30

latex3 — issues with cyrillic

3 Answers3

Linked