How to letter space (German) abbreviations automatically

Question

In German abbreviations like "z. B." should be written with a space in between. In good old typewriter days the usual way to write them was without space. And even today I often type them without a space. Is there a way to automatically replace them with a correctly spaced version (e.g. with \, in between)?

I'm on LuaLaTeX, so a Lua solution would be okay.

Example document:

\documentclass{article}
\usepackage[ngerman]{babel}
\begin{document}
Dies ist z.B. ein Test.
\end{document}

This input should result in an output equivalent to z.\,B. in the document.

This is self-answered. If anyone has a better solution, I would be happy to see (and accept) it. — TeXnician, Aug 13 '17 at 18:05
Quick question: Are all patterns of the form <single letter>.<single letter>.? Please advise. — Mico, Aug 13 '17 at 18:37
@Mico There is for example etc.pp. (but I haven't seen that abbreviation in any text in my entire life). In general they might exist but are not that common. — Skillmon, Aug 13 '17 at 18:51
@Mico I agree with Skillmon, that for the most cases you're right about that pattern. — TeXnician, Aug 13 '17 at 19:48
Generally, I create macros like \zB (or in my case, \eg, or \ie) to give the desired spacing. \documentclass{article} \newcommand\zB{z.\,B.} \begin{document} Dies ist \zB{} ein Test. \end{document} That way, if I ever change my mind on the notion of "proper" spacing, a one line fix fixes the whole document. — Steven B. Segletes, Aug 14 '17 at 12:16

Mico · Accepted Answer · 2018-02-14T18:02:45.970

(I rewrote this answer rather significantly after discussions with OP and after receiving very significant coding help from @EgorSkriptunoff)

Here's a solution that doesn't pre-specify a list of all abbreviations for which thinspace should be inserted after interior periods (aka "full stops"). Instead, it sets up a pattern matching function to capture u.a., u.a.m., u.v.a.m., z.Zt., Bem.d.Red. and many more such cases. (See the function insert_thinspaces in the code below for the exact pattern matches that are performed.)

Observe also the use of unicode.utf8.gsub instead of string.gsub inside the Lua function insert_thinspaces. This lets the code deal correctly with non-ASCII-encoded letters, such as ä and Ä, which may occur in abbreviations.

On the downside (potentially), this solution method doesn't capture abbreviations if they occur at the start of a sentence, e.g., Z.T. or U.U.; for what it's worth, your parallel answer currently doesn't catch such cases either, right?

The Lua function is assigned to the process_input_buffer callback via a LaTeX macro called \ExpandAbbrOn. If, for any reason, you need to suspend operation of the Lua function, simply execute the instruction \ExpandAbbrOff.

The code checks if the string to be processed lies inside a verbatim-like environment such as verbatim, Verbatim, and lstlisting; if that's the case, no processing is performed. And, with the latest iteration, the code now also ignores material that's in the arguments of inline-verbatim-like macros, such as \verb, \Verb, \lstinline, and \url. For sure, the contents of URL strings should never be processed by the Lua function, right?

% !TeX program = lualatex
\documentclass{article}
\usepackage[ngerman]{babel}
\usepackage{fancyvrb}        % for "Verbatim" env.
\usepackage[obeyspaces]{url} % for "\url" macro
\usepackage{listings}        % for "\lstinline" macro

%% Lua-side code:
\usepackage{luacode} % for 'luacode*' environment
\begin{luacode*}

-- Names of verbatim-like environments
local verbatim_env = { "[vV]erbatim" , "lstlisting" }

-- By default, we're *not* in a verbatim-like env.:
local in_verbatim_env = false 

-- Specify number of parameters for every macro; use neg. numbers 
-- for macros that support matching pair of curly braces {} 
local all_macros = {
          verb = 1,
          Verb = 1,
          lstinline = -1,
          url = -1
}

-- List all poss. delimiters
local all_delimiters = [[!"#$%&'*+,-./:;<=>?^_`|~()[]{}0123456789]]

-- Quick check if "s" contains an inline-verbatim-like macro:
function quick_check ( s )
  if s:find("\\[vV]erb") or s:find("\\url") or s:find("\\lstinline") then
    return true
  else
    return false
  end
end

-- Function to process the part of string "s" that 
-- does *not* contain inline-verbatim-like macros
local function insert_thinspaces ( s ) 
  s = unicode.utf8.gsub ( s , 
      "(%l%.)(%a%l?%l?%.)(%a%l?%l?%.)(%a%l?%l?%.)", 
      "%1\\,%2\\,%3\\,%4" ) -- e.g. "u.v.a.m.", "w.z.b.w."
  s = unicode.utf8.gsub ( s , 
      "(%l%.)(%a%l?%l?%.)(%a%l?%l?%.)", 
      "%1\\,%2\\,%3" ) -- e.g., "a.d.Gr." 
  s = unicode.utf8.gsub ( s , 
      "(%u%l%l?%.)(%a%l?%l?%.)(%a%l?%l?%.)", 
      "%1\\,%2\\,%3" ) -- e.g., "Anm.d.Red."
  s = unicode.utf8.gsub ( s , 
      "(%l%.)(%a%l?%l?%.)", 
      "%1\\,%2" ) -- e.g., "z.T.", "z.Zt.", "v.Chr."
  return s
end

-- Finally, the main Lua function:
function expand_abbr ( s )
  -- Check if we're entering or exiting a verbatim-like env.;
  -- if so, reset the 'in_verbatim_env' "flag" and break.
  for i,p in ipairs ( verbatim_env ) do
    if s:find( "\\begin{" .. p .. "}" ) then
      in_verbatim_env = true
      break
    elseif s:find( "\\end{" .. p .. "}" ) then
      in_verbatim_env = false
      break
    end
  end
  -- Potentially modify "s" only if *not* in a verbatim-like env.:
  if not in_verbatim_env then
    -- Quick check if "s" contains one or more inlike-verbatim-like macros:
    if quick_check ( s ) then
      -- See https://stackoverflow.com/a/45688711/1014365 for the source
      -- of the following code. Many many thanks, @EgorSkriptunoff!!
      s = s:gsub("\\([%a@]+)",
        function(macro_name)
          if all_macros[macro_name] then
            return
              "\1\\"..macro_name
              ..(all_macros[macro_name] < 0 and "\2" or "\3")
              :rep(math.abs(all_macros[macro_name]) + 1)
          end
        end
        )
      repeat
        local old_length = #s
        repeat
          local old_length = #s
          s = s:gsub("\2(\2+)(%b{})", "%2%1")
        until old_length == #s
        s = s:gsub("[\2\3]([\2\3]+)((["..all_delimiters:gsub("%p", "%%%0").."])(.-)%3)", "%2%1")
      until old_length == #s
      s = ("\2"..s.."\1"):gsub("[\2\3]+([^\2\3]-)\1", insert_thinspaces):gsub("[\1\2\3]", "")
    else
      -- Since no inline-verbatim-like macro found in "s", invoke 
      -- the Lua function 'insert_thinspaces' directly.
      s = insert_thinspaces ( s )
    end
  end
  return(s)
end

\end{luacode*}

%% LaTeX-side code: Macros to assign 'expand_abbr' 
%% to LuaTeX's 'process_input_buffer' callback.
\newcommand\ExpandAbbrOn{\directlua{%
  luatexbase.add_to_callback("process_input_buffer", 
  expand_abbr, "expand_abbreviations")}}
\newcommand\ExpandAbbrOff{\directlua{%
  luatexbase.remove_from_callback("process_input_buffer", 
  "expand_abbreviations")}}
\AtBeginDocument{\ExpandAbbrOn} % enabled by default

%% Just for this example:
\setlength\parindent{0pt}
\obeylines

\begin{document}

Dies ist u.U. ein Test.
\begin{Verbatim}
Dies ist u.U. ein Test.
\end{Verbatim}

z.B. u.a. u.Ä. u.ä. u.U. a.a.O. d.h. i.e. v.a.
i.e.S. z.T. m.E. i.d.F. z.Z. u.v.m. z.Zt.
u.v.a.m. b.z.b.w. v.Chr. a.d.Gr. Anm.d.Red.

\begin{verbatim}
z.B. u.a. u.Ä. u.ä. u.U. a.a.O. d.h. i.e. v.a.
i.e.S. z.T. m.E. i.d.F. z.Z. u.v.m. z.Zt.
u.v.a.m. b.z.b.w. v.Chr. a.d.Gr. Anm.d.Red.
\end{verbatim}

U.S.A. U.K. % should *not* be processed

\lstinline|u.a. u.Ä. u.ä.|; \Verb$u.a. u.Ä. u.ä.$

% nested verbatim-like macros
\Verb+u.U.\lstinline|u.U.|u.U.+  \lstinline+u.U.\Verb@u.U.@u.U.+

% 2 URL strings
u.U. \url{u.U.aaaa.z.T.bbb_u.v.a.m.com} u.U.
u.U. \url?u.U.aaaa.z.T.bbb_u.v.a.m.com? u.U.
\end{document}

Great additions and changes. Do you think it's even possible to capture the sentence starters without knowing the exact abbreviations? — TeXnician, Aug 13 '17 at 19:49
Well, If you'd expand your example that it also checks for some given inline verbatim commands (\lstinline!z.B.!, \verb+z.B.+), I'll accept it. — TeXnician, Aug 13 '17 at 19:54
@TeXnician - I've added code for LaTeX macros named \ExpandAbbrOn and \ExpandAbbrOff, resp. (The default state is "\ExpandAbbrOn".) This is useful not only for dealing with inline verbatim material -- such as \lstinline!z.B.! and \verb+u.U.+ -- but also for dealing with URL strings which happen to contain substrings of the form z.B. and u.U. Automating the exclusion of inline verbatim material of this type is quite tricky, because the delimiters need not be a matching pair of curly braces. Instead, as your examples show, the delimiters may also be !...!, +...+, etc. Arrgh! — Mico, Aug 13 '17 at 20:08
@TeXnician - Would you be willing to accept a solution that suppresses the operation of the abbreviation expansion on a line that contains one or more instances of \lstline, \verb, or \url? The heuristic is that input lines that contain, say, \url{...} are quite unlikely to contain abbreviations such as u.a. and u.U., right? — Mico, Aug 13 '17 at 21:51
Well, my heuristic is quite different: I sometimes write "Ein einfaches LaTeX-Makro ist z.B. \lstinline!z.B.!" (of course not that content), so with lines it's quite problematic. One option for implementation of the char matching (will try it today): Search for a inline listing macro name (index), use index + length -> character, find next occurrence of that character, do not replace within. — TeXnician, Aug 14 '17 at 05:39
@TeXnician - Fair enough! I'll await your implementation of the idea you laid out in the preceding comment. — Mico, Aug 14 '17 at 05:44
Just found another abbreviation that doesn't match the pattern (which isn't very common either, but I think more likely to appear than etc.pp.): z.Hd. for "zu Händen" — Skillmon, Aug 14 '17 at 09:09
@Skillmon - I suppose it's not particularly burdensome to add a third search pattern, viz., <lowercase letter>.<uppercase letter><lowercase letter>. Can you think of other examples (besides "z.Hd.") that might fit this specific pattern? — Mico, Aug 14 '17 at 09:38
@Skillmon - In the meantime, I've also come up with the abbreviation "u.v.a.m." ("und viele andere mehr"). Can you think of additional examples that would fit this four-letter-four-dots pattern? — Mico, Aug 14 '17 at 09:40
@Mico "o.B.d.A." ("ohne Beschränkung der Allgemeinheit") is quite common in mathematical thesis I guess. — Skillmon, Aug 14 '17 at 09:47
@Mico "w.z.b.w." ("was zu beweisen war") and "u.d.Nb." ("unter der Nebenbedingung") might occur from time to time, too. And there are "n.Chr." and "v.Chr." ("nach/vor Christi (Geburt)"). "z.Zt." ("zur Zeit"), "a.d.Gr." ("aus diesem Grund"), "a.d.Bs." ("aus dem Besitz"), "Anm. d. Red." ("Anmerkung der Redaktion" -- I have no idea how this should be formatted). There might be some names of cities with stuff like "a.Rh." ("am Rhein") or similar... — Skillmon, Aug 14 '17 at 10:28
@Mico I've tried, but my attempts to implement a working inline verbatim approach were either too slow for productive use or did not work altogether. But Skillmon has a nice approach. — TeXnician, Aug 14 '17 at 16:36

score 13 · Answer 2 · edited Feb 14 '18 at 18:11

Here's a simple solution to the problem. It uses the Lua callback process_input_buffer to scan each input line for one of the given abbreviations and insert a small space in there.

For that action you only need to specify the abbreviation (table key) you want to replace with a spaced version (table value). That mechanism can, of course, be used to simply replace any content (table key) with some other (table value).

This solution also enables you to use verbatim input, but you have to make the verbatim environment known to the script. If the script wouldn't check for verbatim environments those changes would be visible to the reader.

You should note that the function works, but may not be the fastest as it always checks every line whether it is a verbatim start/end for every verbatim environment known to the script.

Update: I have simplified user input (dicitonary) by generating the required spaces automatically. The most problematic part mentioned in the discussions are abbreviations at sentence start and in inline verbatim. The first one may be handled with this version (check for a dot, a space and then the abbreviation with capitalized letter) but the second one may be very hard to detect and change.

\documentclass{article}
\usepackage[ngerman]{babel}
\usepackage{luacode}
\begin{luacode}

local tabbr = {"z.B.","u.a.","u.Ä.","u.ä.","u.U.","a.a.O.","d.h.","i.e.",
  "i.e.S.","v.a.","z.T.","m.E.","i.d.F.","z.Z.","u.v.m.","u.v.a.m.",
  "z.Hd."}
local verbenv = {"[vV]erbatim","lstlisting"}
local tsub = {}
local inverb = false

function createsubstitutes()
  for i,p in ipairs(tabbr) do
    tsub[p] = unicode.utf8.gsub(p:sub(1,p:len()-1), "%.", ".\\,") .. "."
  end
end

function expandabbr(s)
    for i,p in ipairs(verbenv) do
      if s:find("\\begin{" .. p .. "}") then
        inverb = true
        break
      end
      if s:find("\\end{" .. p .. "}") then
        inverb = false
        break
      end
    end
    if not inverb then
      for k,v in pairs(tsub) do
        s = unicode.utf8.gsub(s, k:gsub("(%.)","%%%1"), v)
      end
    end
  return(s)
end

\end{luacode}
\AtBeginDocument{%
  \luaexec{createsubstitutes()}
  \luaexec{luatexbase.add_to_callback("process_input_buffer", expandabbr, "expand_abbreviations")}%
}
\begin{document}
Dies ist z.B. ein Test.\\
In dieser Zeile gibt es z.B. \verb+z.B.+
\begin{verbatim}
Test z.B.
\end{verbatim}
\end{document}

It might enhance the performance to run a script that searches and replaces the faulty abbreviations from the command line instead of inside TeX, this way the script only has to be run once and doesn't execute on every TeX run. And you might want to check for inline-verbatim, too. And I guess that rare cases where you use something like \begin{lstlisting}\begin{verbatim}foo\end{verbatim}z.B.\end{lstlisting} could slip through your check. — Skillmon, Aug 13 '17 at 18:17
+1. To capture the Verbatim environment (provided by the fancyvrb package) as well, just change local verbenv = {"verbatim","lstlisting"} to local verbenv = {"[vV]erbatim","lstlisting"}. — Mico, Aug 13 '17 at 18:35
@Skillmon - In addition to thinking about inline-verbatim material, one might also want to think about URL strings that just happen to contain substrings of the form u.a.., u.U., etc. — Mico, Aug 13 '17 at 19:40
@Skillmon I do not like the idea of having an external program. That's why I've asked the TeX question. But the check for inline verbatim is a bit difficult to deal with. — TeXnician, Aug 13 '17 at 19:52

Skillmon · Answer 3 · 2017-08-14T16:10:46.697

Just a minor tweak to @Mico's answer. I put the replacement in a loop to match abbreviations of unknown length. The drawback is, that each chunk of a abbreviation must meet one of the defined patterns.

For example o.B.d.A. would get evaluated by the second gsub pattern "(%l.)(%a)" and replaced by o.\,B.d.\,A.. In the next loop the pattern wouldn't match, because the chunk B.d. doesn't. I guess the combination of the two patterns should match almost every abbreviation but doesn't create too many false-positives.

Another tweak I made was to check only the closing verbatim-name which matches the first opening. With this a construct of several nested verbatim environments is evaluated correctly. Inline verbatim is still missing though as well as other exceptions like \url.

EDIT: I also wrote a function, that parses inline verbatim correctly, but only if there is only one kind of inline verbatim command per line and if it ends in the same line.

\documentclass{article}
\usepackage[ngerman]{babel}
\usepackage{fancyvrb} % for "Verbatim" environment
\usepackage{luacode}
\usepackage{url}

%% Lua-side code:
\begin{luacode}
local verbatim_env = { "verbatim" , "Verbatim" , "lstlisting" }
local verbatim_inl = { "verb" , "lstinline" }
-- by default, *not* in a verbatim-like env.:
local cur_verbatim_env = nil
local cur_verbatim_inl = nil
function replace_abbrs ( s )
    local rep_rep1 = 1
    local rep_rep2 = 1
    while rep_rep1 ~= 0 or rep_rep2 ~= 0 do
        s,rep_rep1 = unicode.utf8.gsub ( s , "(%l%.)(%a%.)(%a)", "%1\\,%2\\,%3" )
        s,rep_rep2 = unicode.utf8.gsub ( s , "(%l%.)(%a)",     "%1\\,%2" )
    end
    return(s)
end
function expand_inline_verb ( s , p )
    local r = ""
    while string.len(s) > 0 do
        local spos,epos = s:find( p.."%A" )
        if spos ~= nil then
            r = r .. replace_abbrs(s:sub(0,spos-1))
            r = r .. s:sub(spos,epos)
            local delim = s:sub(epos,epos)
            s  = s:sub(epos+1 , string.len(s))
            local verb_end = s:find( delim )
            r = r .. s:sub(0,verb_end)
            s = s:sub(verb_end+1 , string.len(s))
        else
            r = r .. replace_abbrs(s)
            break
        end
    end
    return(r)
end
function expandabbr ( s )
    if cur_verbatim_env == nil then
        for i,p in ipairs ( verbatim_env ) do
            if s:find( "\\begin{" .. p .. "}" ) then
                cur_verbatim_env = verbatim_env[i]
                break
            end
        end
    elseif s:find( "\\end{" .. cur_verbatim_env .. "}" ) then
        cur_verbatim_env = nil
    end
    if cur_verbatim_env == nil and cur_verbatim_inl == nil then
        for i,p in ipairs ( verbatim_inl ) do
            pos = s:find( "\\" .. p )
            if pos ~= nil then
                cur_verbatim_inl = s[pos+string.len(p)+1]
                break
            end
        end
    elseif cur_veratim_inl ~= nil then
        if s:find( cur_veratim_inl ) then
            cur_verbatim_inl = nil
        end
    end
    if cur_verbatim_env == nil then
        for i,p in ipairs ( verbatim_inl ) do
            if s:find( p ) then
                return(expand_inline_verb( s , p ))
            end
        end
        s = replace_abbrs ( s )
    end
    return(s)
end
\end{luacode}

%% LaTeX-side code:
\newcommand\ExpandAbbrOn{\directlua{%
  luatexbase.add_to_callback("process_input_buffer", 
  expandabbr, "expand_abbreviations")}}
\newcommand\ExpandAbbrOff{\directlua{%
  luatexbase.remove_from_callback("process_input_buffer", 
  "expand_abbreviations")}}
\AtBeginDocument{\ExpandAbbrOn} % enabled by default

%% Just for this example:
\setlength\parindent{0pt}
\obeylines

\begin{document}
Dies ist u.U. ein Test.
\begin{Verbatim}
Dies ist u.U. ein Test.
\end{Verbatim}

\begin{Verbatim}
\begin{verbatim}
\end{verbatim}
Dies ist u.U. ein schwierigerer Test.
\end{Verbatim}

z.B. u.a. u.Ä. u.ä. u.U. a.a.O. d.h. i.e. 
i.e.S. v.a. z.T. m.E. i.d.F. z.Z. u.v.m.
z.Zt. o.B.d.A. a.d.Gr. n.Chr. Anm.d.Red.
\verb|z.Zt. o.B.d.A. a.d.Gr. n.Chr. Anm.d.Red.|
\begin{verbatim}
z.B. u.a. u.Ä. u.ä. u.U. a.a.O. d.h. i.e. 
i.e.S. v.a. z.T. m.E. i.d.F. z.Z. u.v.m.
\end{verbatim}
U.S.A. U.K.
\ExpandAbbrOff % turn off abbreviation expansion
A tricky URL: \url{u.U.aaaa.z.T.bbb}
\end{document}

I like the concept. As far as I can tell, one would have to iterate char-by-char to correctly handle inline verbatim. But maybe I'll find another way. — TeXnician, Aug 14 '17 at 12:17
@TeXnician done the inline verbatim problem (but not perfectly). — Skillmon, Aug 14 '17 at 15:54
Very nice. Seems like a working version (in contrast to my attempts). — TeXnician, Aug 14 '17 at 16:33
@TeXnician it does work but as I said only on verbatim commands which has the closing bit in the same line and only on the first type of verbatim inline command (so \verb|z.Z.|\lstinline|z.Z.| would only work on \verb). And any kind of space character as a verb-delimiter wouldn't work while theoretically it is correct syntax (e.g., \verb z.Z. | would fail though it's no problem for LaTeX) — Skillmon, Aug 14 '17 at 16:47
That's what I consider working. Extreme cases are never covered perfectly. But your version does what's intended for: replacing abbreviations in a "normal" document. — TeXnician, Aug 14 '17 at 16:51
@TeXnician and the ending of inline verbatim in another line is prohibited at least for \verb and \lstinline. So the only "true" issue are the space characters. Oh and \url and the like in addition to occurrence at the beginning of a sentence. — Skillmon, Aug 14 '17 at 16:52
Just realized that space characters aren't supported by LaTeX for \verb (at least not space or ^I) — Skillmon, Aug 14 '17 at 16:58
@TeXnician - I'm afraid I've hit a dead-end -- my Lua programming skills just aren't that solid. :-( I'm going to post a query to stackoverflow.com, asking for some help on how to go about solving this. — Mico, Aug 14 '17 at 18:56
@Mico how did you try to solve it? My Lua skills are close to none existent (and the fact that Lua doesn't have a string.split() function as far as I know added to it). I guess with some reworking and testing I could get it to work for several verbatim inline functions in the same row, the problem I do have to solve is recognizing nested use (and preferably without character wise parsing). For me Lua coding feels like working with an ugly version of C. — Skillmon, Aug 14 '17 at 20:31
@Skillmon - Thanks. I posted a query to stackoverflow last night; I haven't had a chance yet to look at the answers and decide how I might use them. About Lua: it's designed to be extremely "light-weight", and fast. Its string functions provide pattern matching , not regular expression machinery. (I read somewhere that somebody has written an external library for Lua to provide full regex; however, this external library is larger than the entire (basic) Lua program.) I hope to have a working solution by the end of the day. (I do have a day job.) — Mico, Aug 15 '17 at 05:24
@Mico I know about the design choices of Lua to be lightweight. My window manager (Awesome) makes use of it and is fully configurable in it. But I don't like much of its syntax and often think doing the stuff in C would be easier (for me). I'll take a look at your query myself :) — Skillmon, Aug 15 '17 at 07:59
@Skillmon - I've updated my answer, thanks to some excellent code obtained in response to my query over on stackoverflow. — Mico, Aug 15 '17 at 19:33

How to letter space (German) abbreviations automatically

3 Answers3

Linked