10

Is there a way to count the number of characters in a specified string?

Suppose I had the following code.

\documentclass{article}
\newcommand{\numchars}[1]{\noindent The string ``#1" has ? characters.\\}
\begin{document}
\numchars{everything}
\numchars{that's not it!}
\numchars{weird}
\end{document}

How would I make it display the correct character count like this

character count

without having to do a manual count?

qzx
  • 550
  • 3
  • 13

6 Answers6

12

If your argument contains macros, the answer would need to change. Spaces count as characters, though that could be adjusted if you desired.

\documentclass{article}
\usepackage{stringstrings}
\newcommand{\numchars}[1]{\noindent The string ``#1'' has \stringlength{#1} characters.\\}
\begin{document}
\numchars{everything}
\numchars{that's not it!}
\numchars{weird}
\end{document}

enter image description here

Here's a version that does not count spaces.

\documentclass{article}
\usepackage{stringstrings}
\newcommand{\numchars}[1]{%
  \convertchar[q]{#1}{ }{}%
  \noindent The string ``#1'' has \stringlength{\thestring} characters.\\
}
\begin{document}
\numchars{everything}
\numchars{that's not it!}
\numchars{weird}
\end{document}

enter image description here

And if you wanted to count only alphabetic characters (ignoring numbers, spaces and punctuation)

\documentclass{article}
\usepackage{stringstrings}
\newcommand{\numchars}[1]{%
  \convertchar[q]{#1}{ }{}%
  \alphabetic[q]{\thestring}%
  \noindent The string ``#1'' has \stringlength{\thestring} characters.\\
}
\begin{document}
\numchars{everything}
\numchars{that's not it!}
\numchars{weird}
\end{document}

enter image description here

  • 3
    \StrLen from xstring is also an option. Although, I'm not sure which one is the better... – Pouya Mar 02 '15 at 11:41
  • @Pouya It is likely that the xstring answer is better, but since I wrote stringstrings, I go with what I know. – Steven B. Segletes Mar 02 '15 at 11:42
  • @pouya, I tried \StrLen from xstring and it also worked. Thanks. – qzx Mar 02 '15 at 12:32
  • The "basic" version of the \numchars macro, i.e., the one that returns 14 for "that's not it!", returns 5 rather than 3 when run on the UTF8-encoded string "öüß", if run under pdfLaTeX and with the inputenc package loaded with the option utf8. Oddly, 6 is returned if the inputenc package is not loaded. And, interestingly, it returns 3 when run under LuaLaTeX (and fontspec is loaded). – Mico Jul 15 '15 at 11:23
  • @Mico Absolutely true that there is no support for Unicode with this solution. – Steven B. Segletes Jul 19 '15 at 19:12
  • I wouldn't put it that negatively. There is full support for Unicode-encoded (more precisely, utf8-encoded) characters if the document is compiled with LuaLaTeX instead of pdfLaTeX. – Mico Jul 19 '15 at 23:18
3

Even though the OP has stated that he/she isn't interested in a LuaLaTeX-based solution, others may still value having such a solution. :-)

The following solution works with strings of UTF8-encoded characters. Because ASCII-encoded characters are automatically UTF8-encoded, the solution also works with ASCII-encoded strings.

enter image description here

% !TEX TS-program = lualatex
\documentclass{article}
\usepackage{fontspec}
\usepackage{luacode} % for "\luastring" macro
\newcommand{\numchars}[1]{\noindent The string ``#1'' has 
    \directlua{tex.sprint(unicode.utf8.len(\luastring{#1}))} 
    characters.\par}

\begin{document}
\numchars{everything}
\numchars{öüß}
\end{document}

Aside: If the Lua-side code inappropriately used the function string.len instead of unicode.utf8.len, the macro \numchars would report that öüß has 6 characters. This happens because each of the 3 characters in öüß is encoded using 2 bytes in the utf8 system. (The function str.len does a byte count rather than a direct character account; that's OK if each character is encoded using exactly 1 byte, which is the case for the ASCII encoding system, though not for most others.) Likewise, the string ø§¶®€œ¥√DZ would incorrectly be diagnosed as having 22 [!] rather than just 10 characters, as both and are encoded using 3 bytes and the remaining 8 characters are encoded using 2 bytes each. Clearly, it's important to use the function unicode.utf8.len in the present context.

Mico
  • 506,678
2

The command \newcommand{\numchars}[1]... works well, but I encountered some issues with \stringlength in the stringstringspackage. It seems like it has a limit of 500 on the number of characters, returning zero if you go above that. For example, the code:

\documentclass[11pt]{amsart}
\usepackage{stringstrings}
\newcommand{\numchars}[1]{\noindent The string ``#1'' has \stringlength{#1} characters.\\}

\begin{document}
\numchars{Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tris- tique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pul- vinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
\end{document}

Returns:

The command \StrLen in the xstring package seems to work better. The document:

\documentclass[11pt]{amsart}
\usepackage{xstring}
\newcommand{\numchars}[1]{\noindent The string ``#1'' has {\StrLen{#1}} characters.\\}

\begin{document}
\numchars{Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tris- tique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pul- vinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
\end{document}

Returns:

Cão
  • 129
  • 1
    Welcome to TeX.SX! Can you expand your answer with an example of how to use \StrLen? As it is now, it's more like a comment than an answer. – egreg Jul 14 '15 at 21:22
  • Welcome to TeX.SX! You can have a look at our starter guide to familiarize yourself further with our format. – Martin Schröder Jul 14 '15 at 22:24
  • If I run your \numchars macro under pdfLaTeX on the UTF8-encoded string "öüß", I get 5 (rather than 3) if the inputenc package is loaded with the option utf8, and I get 6 if the inputenc package isn't loaded. (Interestingly, I get 3 -- the correct number! -- if I run your \numchars macro under LuaLaTeX/fontspec.) Is there a way to fix up your macro so that it runs correctly under pdfLaTeX for UTF8-encoded strings? – Mico Jul 15 '15 at 11:31
2

Presumably there's an excellent reason not to use l3 syntax because egreg has not written an answer.

Also note: I do not know what I'm doing.


Caveat emptor...

The expl3 package is used to enable l3 syntax. (The LaTeX equivalent of the latest thing since sliced bread.) xparse is used to easily define a starred form of the command which excludes spaces from the character count.

Because l3 is oblivious to spaces by default, the extra work actually goes into the non-starred form of the command which converts all spaces to xs before doing the count.

Note that this solution will count accented characters as decomposed with pdfTeX. With Xe/LuaTeX, it works provided a font supporting the characters is used. Thanks to comments for discussion.

\ExplSyntaxOn  % enable l3 syntax
\tl_new:N \l_qzx_string_tl  % declare a local token list to hold qzx's string
\NewDocumentCommand \numchars { s m }{  % command optionally takes a star and requires a single argument
  \group_begin:
  \tl_set:Nn \l_qzx_string_tl { #2 }  % set the token list to the string we've been fed
  \IfBooleanF { #1 }  % if there is no star
    {  % then replace all instances of a space (~ in l3 syntax) by instances of x
      \tl_replace_all:Nnn \l_qzx_string_tl { ~ } { x }
    }
  % the count of the characters in the token list goes straight into the stream to be typeset but we need to add the spaces we want here explicitly using ~
  \noindent The~string~``#2"~has~\tl_count:N \l_qzx_string_tl{}~characters.\par  % use \par rather than \\ to avoid complaints about bad boxes
  \group_end:
}
\ExplSyntaxOff% turn l3 syntax off so everything is back to normal and giraffes are giraffes once more

count strings 2 ways

Complete code:

\documentclass{article}
\usepackage{expl3,xparse}
\ExplSyntaxOn
\tl_new:N \l_qzx_string_tl
\NewDocumentCommand \numchars { s m }{
  \group_begin:
  \tl_set:Nn \l_qzx_string_tl { #2 }
  \IfBooleanF { #1 }
    {
      \tl_replace_all:Nnn \l_qzx_string_tl { ~ } { x }
    }
  \noindent The~string~``#2"~has~\tl_count:N \l_qzx_string_tl{}~characters.\par
  \group_end:
}
\ExplSyntaxOff
\begin{document}
  \numchars{everything}
  \numchars*{everything}
  \numchars{that's not it!}
  \numchars*{that's not it!}
  \numchars{weird}
  \numchars*{weird}
  \numchars{%
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices.  Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
  \numchars*{%
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices.  Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
\end{document}

This will certainly break if fed wonderful things before breakfast, is likely to feel a little fragile prior to lunch and will probably need to spend the afternoon recovering before getting an early night.

As I said, caveat emptor....

cfr
  • 198,882
  • Your code, when run on the UTF8-encoded string "öüß", returns 6 rather than 3 as the character count. Is there a way to adjust the macro so that it counts characters rather than bytes in the string? – Mico Jul 15 '15 at 07:55
  • @Mico It does count 6 if one uses pdfTeX, if you use LuaTeX or XeTeX it does count three. Not the perfect solution, I know. – Manuel Jul 15 '15 at 08:10
  • @Manuel Yes, but with Xe/LuaTeX the count is right by the content of the string disappears and you end up with The string “” has 3 characters. which is not good (and makes the statement false even though the count was originally right). – cfr Jul 15 '15 at 12:43
  • @Mico I don't know. I'm very new to this syntax. Perhaps this is why egreg didn't answer? – cfr Jul 15 '15 at 12:43
  • @Mico For now, I've just explained the limitation.... – cfr Jul 15 '15 at 12:49
  • 1
    @cfr That only has to do with the font used, if you load fontspec in Xe-LuaTeX it all goes well. But, as I've shown in my answer (not complete, though), it can also be done in pdfTeX. – Manuel Jul 15 '15 at 12:57
  • 2
    @Manuel 'Characters' is poorly defined in the question: for example, what about foo{bar} (assuming normal catcodes). I think provided the details are clear any reasonable outcome is OK. (To count öüß with pdfTeX you'd need to convert to LICR, for example.) – Joseph Wright Jul 15 '15 at 12:57
  • @cfr - Just as an aside: To get your code to print The string “öüß” has (x) characters rather than The string “” has 3 characters, it suffices to load the inputenc package with the option utf8 and the fontenc package with the option T1. – Mico Jul 15 '15 at 13:29
  • @Mico I get 6 in that case. (The string is printed correctly but the count is wrong.) – cfr Jul 15 '15 at 17:31
  • Sorry, I meant write 6, not 3. Your code, at least, gives the "correct incorrect" result of 6 when run on "öüß" under pdfLaTeX. In contrast, the code in the answers of Steven and Cão returns 5 (when run under pdfLaTeX and the inputenc package is loaded with the option utf8). I can understand where 6 comes from (viz, counting the bytes used to encode the characters), but I have no idea where the 5 result may come from. – Mico Jul 15 '15 at 23:07
  • @Mico Oh, I see. Yes, it does. The 6 made sense to me. Not sure what to say about 5 so I'm glad my code didn't produce it else I'd have to worry about it! – cfr Jul 15 '15 at 23:18
2

The problem Mico points out can be solved in @cfr's solution just by using LuaTeX or XeTeX. If one is bounded to pdfTeX engine, a possible solution is to use the amazingly powerful l3regex package.

Edit: As egreg pointed out, I didn't know that there were so many multibyte prefixes.

\documentclass{scrartcl}
\usepackage{xparse,l3regex}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}

\ExplSyntaxOn
\NewDocumentCommand \numchars { s m }
 {
  \group_begin:
  \tl_set:Nn \l_tmpa_tl { #2 }
  \IfBooleanF { #1 } { \tl_replace_all:Nnn \l_tmpa_tl { ~ } { x } }
  \regex_replace_all:nnN { [\x{C2}-\x{DF}].   } { x } \l_tmpa_tl
  \regex_replace_all:nnN { [\x{E0}-\x{EF}]..  } { x } \l_tmpa_tl
  \regex_replace_all:nnN { [\x{F0}-\x{F4}]... } { x } \l_tmpa_tl

  The ~ string ~ ``#2'' ~ has ~ \tl_count:N \l_tmpa_tl \space characters
  \IfBooleanT { #1 } { ~ (ignoring ~ whitespace)} .\par
  \group_end:
 }
\ExplSyntaxOff

\begin{document}
  \numchars{ßöü—} % em-dash
  \numchars{everything}
  \numchars*{everything}
  \numchars{that's not it!}
  \numchars*{that's not it!}
  \numchars{weird}
  \numchars*{weird}
\end{document}

Result

The string “ßöü—” has 4 characters.
The string “everything” has 10 characters.
The string “everything” has 10 characters (ignoring whitespace).
The string “that’s not it!” has 14 characters.
The string “that’s not it!” has 12 characters (ignoring whitespace).
The string “weird” has 5 characters.
The string “weird” has 5 characters (ignoring whitespace).

Manuel
  • 27,118
  • I'm not sure if I got the active characters of multibyte unicode right: ^^c3, ^^e2, ^^f0? – Manuel Jul 15 '15 at 08:43
  • 2
    There are many more. Two byte prefixes are from C2 to DF; three byte prefixes from E0 to EF; four byte from F0 to F4. – egreg Jul 15 '15 at 10:59
  • Okey, I definitely did not get them right :) Any suggestion as of how to make it easier? Rather than having around 30 lines of code? By the way, where do you learn that? I googled and found nothing. – Manuel Jul 15 '15 at 11:21
  • Perhaps force conversion to LICR first (\protected@edef then some work to cut the result down to the essentials). – Joseph Wright Jul 15 '15 at 13:00
  • @JosephWright You better add that answer, I don't know what LICR is :) – Manuel Jul 15 '15 at 13:01
  • Not very efficient, but you can do \regex_const:Nn \c_utf_two_byte_prefix_regex { (\x{C2}|\x{C3}|\x{C4}|<etc>) . } and similarly for the other prefixes and use \regex_replace_all:NnN \c_utf_two_byte_prefix_regex { x } – egreg Jul 15 '15 at 13:07
  • @egreg Is that faster than just many many \regex_replace_all:nnN? – Manuel Jul 15 '15 at 13:20
  • I did no benchmarking. I didn't try at all, TBH. – egreg Jul 15 '15 at 13:26
  • @egreg Much easier: \regex_replace_all:nnN { [\x{C2}-\x{DF}]. } { x } \l_tmpa_tl. Yeah, efficiency won't be the best here :) – Manuel Jul 15 '15 at 13:30
0

What about the following solution?

 \def\numchars #1{%
   \setbox0=\hbox{\tt#1}
   \setbox1=\hbox{\tt 1}
   \count100 = \wd0
   \count101 = \wd1
   \divide\count100 by \count101
   The string "#1" has \the\count100\ characters.
}

\numchars{This is the character"a!}

will produce The string "This is the characterä!" has 23 characters.

The macro will work if the characters are in or produced from the charaters in the used font (for example cmr10).

ecki
  • 21