Multibyte StrLen? (StrLen for chinese characters)

Question

I am currently using \StrLen{#1} inside my \newcommand. This works flawlessly for any common string written in latin alphabet.

"Hello" has string length of 5 for example. Problem is with chinese characters. String length of "容容" is 8 which is technically corrent, but I wasn't able to find multibyte alternative to StrLen which would return 2.

Note: I am using pdflatex.

Regards, Jan

Welcome to TeX.SX! Please help us help you and add a minimal working example (MWE) that illustrates your problem. Reproducing the problem and finding out what the issue is will be much easier when we see compilable code, starting with \documentclass{...} and ending with \end{document}. — Skillmon, Mar 08 '18 at 22:46
if you used luatex or xetex then each unicode character would be a single token otherwise you need to count character tokens ignoring any above hex 80 and less than hex C0 — David Carlisle, Mar 08 '18 at 22:50

score 8 · Accepted Answer · answered Mar 08 '18 at 23:19

8

You can count the utf-8 start bytes so for example

\documentclass{article}
\usepackage[utf8]{inputenc}

\makeatletter
\def\zz#1{\zzz0#1\relax}
\def\zzz#1#2{%
\ifx\relax#2 \the\numexpr#1\relax
\else
\expandafter\zzz\expandafter{%
  \the\numexpr(#1+\ifnum\expandafter`\string#2<"80 1\else \ifnum\expandafter`\string#2>"BF 1 \else 0 \fi\fi
  \expandafter)\expandafter\relax\expandafter}%
\fi}
\begin{document}

\zz{容容}

\zz{abc}

\zz{¢Àïα}

\end{document}

answered Mar 08 '18 at 23:19

David Carlisle

757,742

This is exactly the solution I need. Tested and it works! – Jan Vorisek Mar 09 '18 at 12:58
Note that this gives codepoints, not graphemes. These will be the same on PDFTeX (which does not support combining characters), but not necessarily in LuaTeX or XeTeX. – Davislor Dec 10 '20 at 21:29
@Davislor true although that is what (some) definitions of string length for a unicode string would expect. – David Carlisle Dec 10 '20 at 22:02

Mico · Answer 2 · 2020-12-10T15:20:19.407

1

Just for the sake of variety, here's a LuaLaTeX-based solution.

\documentclass{article}
\newcommand\zz[1]{\directlua{tex.sprint(utf8.len("#1"))}}
\begin{document}
\zz{Hello}, \zz{容容}, \zz{¢Àïα}
\end{document}

If your TeX distribution is quite old (say, at least 4 years old as of late-2020), simply replace utf8.len with unicode.utf8.len to get the code to run.

edited Dec 10 '20 at 15:20

answered Dec 10 '20 at 15:11

Mico

506,678

Multibyte StrLen? (StrLen for chinese characters)

2 Answers2

Linked