Parsing utf8 problems

Question

There are a number of questions relating to problems when using utf8 input encoding, and the closest I found to my problem is, I think, What characters in a normal text document will screw up LaTeX?.

A user of my censor package asked if I can get the package's \blackout macro to work with utf8 umlauts. So far, the answer is "no". I narrowed the problem down to the macro \bl@t which censors argument #2 and then reinvokes a recursion via argument #1. The problem, best I can tell, is that utf8 encoding requires more than 1 byte for things like umlauted characters, and so the #2 passed to \bl@t is only half a character, and so it chokes.

Here is an MWE that, if you uncomment either of the two commented lines, will break the code:

\documentclass{article}
\usepackage[ngerman]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{censor}

\makeatletter
\def\stringend{$}

\long\def\blackout#1{\def~{-}\censor@Block#1\stringend\let~\sv@tilde}
\long\def\censor@Block{\IfNextToken\stringend{\@gobble}%
  {\IfNextToken\@sptoken{ \bl@t{\censor@Block}}%
  {\bl@t{\censor@Block}}}}

\long\def\bl@t#1#2{\if\bpar#2\par\else\if.#2\censordot\else\censor{#2}\fi\fi#1}
\makeatother

\begin{document}
äöüß, \censor{äöüß}\par
\blackout{ab\par cd}\par
%\blackout{ä}\par

\makeatletter
%\bl@t xä
\end{document}

The package's \censor macro works fine on the umlauted stuff, but \blackout and more specifically, the \bl@t service routine, do not. If you want it more simplified, you can think of \bl@t as \def\bl@t#1#2{\censor{#2}#1} (but this will not work with \pars in the input stream). The #1 is always a reinvocation of \censor@Block on the remaining input string.

EDIT: It would seem that, if a multi-byte input character is next in the input stream, then this definition

\long\def\bl@t#1#2#3{\if\bpar#2#3\par\else\if.#2#3\censordot\else
       \censor{#2#3}\fi\fi#1}

can absorb it properly. Thus, reversing which invocations are commented:

%\blackout{ab\par cd}\par
\blackout{äöüß}\par

\makeatletter
\bl@t xä

works fine. So the key will be to be able to determine in advance which type of character lies next in the input stream and choose the appropriate parsing method.

Unicode characters indeed are more than one char (for pdfTeX). See @JLDiaz's nice answer here: http://tex.stackexchange.com/a/86300/5049 — cgnieder, Dec 03 '13 at 20:22
You might be able to adapt my code here: http://tex.stackexchange.com/a/34810/4427 — egreg, Dec 03 '13 at 20:27
@cgnieder So it would seem the approach should be to search somehow for the high bit set in #2. If set, don't \censor{#2}, but set aside that byte, and wait for the next one in the input stream and then recombine them. — Steven B. Segletes, Dec 03 '13 at 20:27
you don't have to look at the bit battern, simply the characters tex definition eg \showö will show > �=macro: ->\UTFviii@two@octets �. which tells you the first byte of o umlaut is the first byte of a two octet sequence (you could get three or four, depending on the character) — David Carlisle, Dec 03 '13 at 21:09
do you really intend \if\bpar#2\par ? \bpar seems to be \endgraf (ie \let to the primitove \par) so the effect is that any non expandable token other than a character is replaced by \par. — David Carlisle, Dec 03 '13 at 22:11
@DavidCarlisle \bpar is a deprecated feature retained for backward compatibility. If it shows in the user's input stream, it should be treated as a \par. As far as \if vs. \ifx, I trust your interpretation better than my own limited understanding. The purpose of \blackout is to censor the input stream one character at a time, making allowances for space tokens, \par tokens, and periods. — Steven B. Segletes, Dec 03 '13 at 23:12
\if considers all (non expandable) commands equal \if\par\vskip is true. But even as edited it is not testing for \bpar in the input stream \par \bpar \endgraf etc would all test as true with \ifx\bpar#2 — David Carlisle, Dec 03 '13 at 23:14
@DavidCarlisle I see. In this case though, while converting a \vskip into a \par would not be ideal, I'm sure that \censor{\vskip{...}} would likewise not turn out well. — Steven B. Segletes, Dec 03 '13 at 23:18
@DavidCarlisle I want \bpar and \par to both test as true in that test, so what you have is correct. — Steven B. Segletes, Dec 03 '13 at 23:21

score 6 · Accepted Answer · answered Dec 03 '13 at 22:28

6

rather than worry about utf-8 it's probably best to let inputenc worry about that and expand all characters to LICR (latex internal form) then you would also be able to support [latin1] or anything else. I also changed \if it \ifx as \if looked wrong (although I don't really know what the code is doing:-)

enter image description here

\documentclass{article}
\usepackage[ngerman]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{censor}

\makeatletter
\def\stringend{$}


\long\def\blackout#1{%
\protected@edef\tmp{#1}%
\def~{-}\expandafter\censor@Block\tmp\stringend\let~\sv@tilde}
\long\def\censor@Block{\IfNextToken\stringend{\@gobble}%
  {\IfNextToken\@sptoken{ \bl@t{\censor@Block}}%
  {\bl@t{\censor@Block}}}}

\long\def\bl@t#1#2{%
\ifx\bpar#2\let\next\par\else\ifx.#2\let\next\censordot\else\def\next{\censor{#2}}\fi\fi
  \next#1}

\makeatother

\begin{document}
äöüß, \censor{äöüß}\par
abcd, \censor{abcd}\par
\blackout{ab\par cd}\par
\blackout{ä}\par


\end{document}

answered Dec 03 '13 at 22:28

David Carlisle

757,742

I see that this works, but I'm trying to understand where LICR was invoked. Is that the babel package that does that? – Steven B. Segletes Dec 03 '13 at 23:14
@StevenB.Segletes LICR (latex internal character representation) isn't babel it is the basis of the latex \protect mechanism and in this case in particular of the way inputenc works. add \show\tmp on the line after the \protected@edef to compare \tmp (which is what gets passed to your cancel code) with the initial input. – David Carlisle Dec 03 '13 at 23:19
Ah ha! It is the \protected@edef which does the conversion! Got it. Clever! – Steven B. Segletes Dec 03 '13 at 23:23
One final note: Using your \ifx version of \bl@t breaks the code at http://tex.stackexchange.com/questions/126291/list-of-underlining-packages-pros-and-cons/126358#126358, whereas using the \if version does not, so I think I will end up replacing your \ifx tests with \if. – Steven B. Segletes Dec 04 '13 at 11:36

Parsing utf8 problems

1 Answers1

Linked