Detect beginning of a sentence in a macro for capitalization

Question

Is it possible to make macros give different results at the beginning of a new sentence? Suppose that I want the macro "\secname" to write "Section" at the beginning of a new sentence, and "section" anywhere else. How can I do so?

If you would like to say "\secname~2 is about blah." and also "In \secname~2 we talk about blah", then "Section" should be capitalized in both cases. At least according to Strunk & White. — Matthew Leingang, Nov 02 '10 at 15:37
@Matthew: This is a common in styles, but not universal. E.g., the CUP style guide expresses a preference for not capitalising these (see Butcher's Copy-editing, 2006, p.129). — Charles Stewart, Nov 10 '10 at 09:05
The capitalisation difference is a red herring: there are other ways in which the output could be different. For example several journals require "Section" at the start of a sentence but "Sec." otherwise (both capitalised). — andybuckley, May 06 '14 at 15:09

score 23 · Accepted Answer · edited Nov 03 '10 at 14:12

23

You could set the \sfcode of the "end of sentence" chars to something different and test for it:

 \documentclass[10pt]{report}
 \sfcode`\.=1001
 \sfcode`\?=1001
 \sfcode`\!=1001
 \sfcode`\:=1001
 \newcommand\secname{\ifnum\spacefactor=1001 Secname\else secname\fi}
 \begin{document}
 abc. \secname\ is \secname.

 e.g.\@ \secname
 \end{document}

\nonfrenchspacing is also setting the \sfcodes. In this case you could use something like this:

 \documentclass[10pt]{report}
 \nonfrenchspacing
 \newcommand\secname{\ifnum\spacefactor>1900 Secname\else secname\fi}
 \begin{document}
 abc. \secname\ is \secname.
 abc: \secname, \secname.
 e.g.\@ \secname
 \end{document}

edited Nov 03 '10 at 14:12

Hendrik Vogt

37,935

answered Nov 03 '10 at 13:36

Ulrike Fischer

327,261

Wow, using \sfcode is a cool idea. The only drawback is that you have to use \@ instead of \ if you want to say that no sentence is ended here. (I don't fully understand what's happening here.) By the way, I slightly modified your code so that \secname does not automatically print a space; hope that's OK. – Hendrik Vogt Nov 03 '10 at 14:11
2

Using the spacefactor is a good attempt. However, it will still not do proper sentence disambiguation. For example try \newcommand\secname{\ifnum\spacefactor=1001 (See Section) \else (see Section)\fi}
Corporation ABC \secname issued a paper written by John et.al. \secname which was well received by the press. \secname'. Since TeX allocates ansfcode` for capitals of 999 it will work well, with most abbreviations (except lower case or mixed cased abbreviations, such as Pty. Ltd. etc. ).
– yannisl Nov 03 '10 at 14:27
1

@ simply sets \spacefactor=1000, so it overwrites the \spacefactor set by the period. – Ulrike Fischer Nov 03 '10 at 14:29
@Ulrike: Thanks for adding this. (I had already looked that up, but probably not everybody else knows.) – Hendrik Vogt Nov 03 '10 at 15:23
2

Inserting \secname\ is \secname right after \begin{document} produces errors. – kiss my armpit Sep 13 '13 at 03:39
Cool: very useful for automatically implementing Elsevier's Sec./Section formatting rule. Only gotcha is that it requires a \protect when called from a float caption (or similar) since \spacefactor is only available in horizontal mode. (Maybe declaring \secname via \DeclareRobustMacro would automatically solve that issue but I've not yet tried.) – andybuckley May 06 '14 at 16:04
Are authors of the cleveref package aware of these tricks? I would love to see an option to automatically switch between \cref and \Cref. Is some of them a user here to tag? Anybody knows if the package does not employ these techniques for some particular reason? – Nicola Gigante Jun 15 '15 at 12:33
@gigabytes (being late on this -- and I'm not an author of cleverref) too many exceptions to be a reliable test. Typing \Cref instead of \cref makes no real difference during typing. You uppercase your own sentences anyway and don't expect TeX to do so for you. The possible number of cases where users would have to proof read their entire document to look for all those \cref's and whether they are in the right case just doesn't seem worth it. – Skillmon Dec 03 '18 at 08:03

score 15 · Answer 2 · edited Nov 03 '10 at 10:03

15

Easiest way is to define two macros:

\def\secname{section}
\def\Secname{Section}

It is a very difficult task to determine sentence boundaries and one of the hottest topics in Computational Linguistics. To do so properly you need to determine that in Dr. Who, for example, the period after the "Dr." does not end a sentence, so you need to parse for all abbreviations and when you think you test for the next letter to start with a capital letter, think of e.g. and all the Latin abbreviations we use.

edited Nov 03 '10 at 10:03

Hendrik Vogt

37,935

answered Nov 02 '10 at 19:23

yannisl

117,160

3

LaTeX has a simple end-of-sentence detection: period, space (but not '\ ') and a capital letter. When this sequence appears, the spacing between the period and the next word is wider then usual. I'll settle for matching that. Btw, the proper LaTeXing is "Dr.\ Who", to avoid that behavior exactly. – Little Bobby Tables Nov 03 '10 at 08:54
+1 for the example! And also for the answer since it makes for cleaner source code. Compare And so to bed. \day~34; got up and had breakfast. with And so to bed. \Day~34; got up and had breakfast.. In the second, all the visual clues are correct in the source for delimiting the sentences. – Andrew Stacey Nov 03 '10 at 12:30
1

@Little Bobby Tables: I would use "Dr.~Who" instead of "Dr.\ Who". IMO it's easier to read and type. The former will also not break a line between "Dr." and "Who". – Matthew Leingang Nov 03 '10 at 14:08
@Little: It's a bit more complicated than that. TeX doesn't do any end of sentence detection. The spacing is completely dependent on the space factor which is set by the \sfcode of a particular character (as well as a few others ways). I explained this elsewhere. – TH. Nov 03 '10 at 20:02
1

I don't think special cases really matters for capitalizing cross references. What would Dr. \secname mean? – Nicola Gigante Jun 15 '15 at 12:35
@gigabytes what about "see e.g. section 3"? – Chris H Jun 15 '15 at 13:36
1

Yes, that's a case. But e.g. should always be followed by \ anyway – Nicola Gigante Jun 15 '15 at 13:38
1

@LittleBobbyTables that is incorrect. First: LaTeX doesn't do any end-of-sentence detection, second the space after the period won't be stretched if the period follows an upper case character (so "Henry VIII. was king" wouldn't be stretched) and the stretching doesn't rely on the next character being upper case (so "e.g. this will be stretched" will be stretched even if "this" doesn't start with an upper case character). Check e.g. \makebox[7cm][s]{Henry VIII. was king back then.}\par\makebox[7cm][s]{e.g. this will be stretched}. – Skillmon Dec 03 '18 at 08:46

score 3 · Answer 3 · answered Nov 03 '10 at 15:31

Before Ulrike posted her nice answer that uses \spacefactor, I had thought this would be impossible in TeX without redefining .. Just for completeness: Here's my answer that does redefine . (after making it active, which probably is not such a good idea). Note that you do not have to use \@ as in Ulrike's solution.

\documentclass{article}
\let\period.
\catcode`.=\active
\let\qwe\relax
\futurelet\myspace{ }
\newcommand.{\period\futurelet\nextchar\testspace}
\newcommand\testspace{\ifx\nextchar\myspace\expandafter\eatspace\expandafter.\fi}
\def\eatspace. { \futurelet\nextchar\testsec}
\newcommand\testsec{\ifx\nextchar\secname\def\qwe{ }\fi}
\newcommand\secname{\ifx\qwe\relax section\else Section\let\qwe\relax\fi}
\begin{document}
abc. \secname\ is \secname.
abc: \secname, e.g.\ \secname.
\end{document}

Yes, This looks as if I had I tried to make it as obscure as possible. Two interesting points: 1. Note the definition of \myspace (\space does not work!). 2. I didn't manage to use LaTeX's \ifnextchar to test if the next character is a space, so I used \futurelet.

I would certainly not redefine the period. Units will explode. Try e.g. \fontsize{1.2cm}{2cm}\selectfont abc. Regarding @: <space> is a primitive command, and I don't think that it would be a good idea to redefine it. But you could redefine <tab>: \def^^I{\spacefactor=1000\ }, then e.g.<tab>\secname would work too. But one must be careful that the editor doesn't convert the tabs to spaces when saving the file. — Ulrike Fischer, Nov 03 '10 at 16:21
@Ulrike: I could say: No problem, use "," instead of "." ;-) But seriously, that's very interesting; I didn't know that dimensions need an explicit period or comma for the decimals. (Anyway my answer shouldn't be taken too seriously). — Hendrik Vogt, Nov 03 '10 at 17:17
Try this \begin{document} \secname\ is \secname. \end{document}. You will see it no longer works as expected. :-) — kiss my armpit, Sep 13 '13 at 03:42

score 2 · Answer 4 · edited Apr 13 '17 at 12:34

As an extension of the @YiannisLazarides answer, you can use mfirstuc which is included with TexLive 2014.

This will take care of the capitalisation for you, and you won't have to repeat yourself when defining the upper-case version of your command.

\documentclass{article}
\usepackage{mfirstuc}
\def\secname{fancy section}
\def\Secname{\expandafter\makefirstuc\expandafter{\secname}}

\begin{document}
Here's a new \secname{}. \Secname{} titles are good.
\end{document}

Which results in:

Fancy section and its capitalised version

score 2 · Answer 5 · edited Jun 10 '20 at 12:32

Here's a new answer, which is inspired in part by the recent query A macro that behaves differently at start of sentence, which got closed as duplicate of the present query (which I hadn't be aware of until today).

Suppose that the author has defined two macros called \agt, which outputs the string "the agent", and \Agt, which outputs "The agent" (with a capital "T"). In addition, let's suppose that the author would like \agt to behave like \Agt automatically if -- and only if -- \agt occurs at the start of a sentence. (For some reason, the author cannot manually replace \agt with \Agt as needed...)

Assuming we may safely dismiss cases such as Mr. \agt or Mrs. \agt as not likely to ever occur in a real document, the following LuaLaTeX-based solution may be of interest. It makes no assumptions about the \sfcode-status of various potential sentence-ending punctuation marks, and it works regardless of whether \frenchspacing or \nonfrenchspacing is in effect. Its working assumption is that \agt should be changed to \Agt under either of the following two circumstances:

\agt occurs at the start of an input line. Whitespace between the start of the line and \agt is OK. Or,
\agt is preceded by ?, !, or . (the three main sentence-ending punctuation characters in the English language), followed by one or more whitespace characters. (Thus, cases such as .\agt and ?\agt are not modified.)

Naturally, if the language your document is written in features different sentence-ending punctuation characters, feel free to modify the list of characters enclosed by [...] in the second gsub statement shown below.

Aside: I wrote this answer assuming that people use an editor that "reflows" paragraphs by inserting "soft" line breaks. (That's the behavior of all editing software I employ...) However, as @jfbu has pointed out in a comment, other editors (including Emacs, apparently -- sorry, I don't use Emacs personally) use a different approach, viz., they reflow paragraphs by inserting "hard" line breaks, in effect storing a paragraph as a sequence of separate strings. Such a an approach clearly raises the risk that some instances of \agt will occur at the start of a line which, in turn, makes the proposed method much less attractive. Sigh. I'm afraid I don't know what advice to give.

Here, then, is an MWE.

% !TEX TS-program = lualatex
\documentclass{article}
\def\agt{the agent}
\def\Agt{The agent}
\usepackage{luacode}
\begin{luacode}
function agt_to_Agt ( s )
  s = s:gsub ( "^%s-\agt", "\Agt" )
  return s:gsub ( "([?!.])%s+\agt" , "%1 \Agt" )
end
\end{luacode}
\AtBeginDocument{\directlua{luatexbase.add_to_callback 
  ( "process_input_buffer" , agt_to_Agt , "agt_to_Agt" )}}
\begin{document}
\noindent
 \agt\ says x. \agt\ shouts Y!  \agt\ asks z? \agt\ whispers something 
to \agt.  \agt\ says the phrase,  ``\agt''.
\end{document}

does this mean that if inside a paragraph \agt falls by chance at start of a line it will be subjected to the transform? this might be problematic when e.g. using M-q to re-flow a paragraph in an Emacs buffer? — , Nov 12 '17 at 12:08
@jfbu - I guess the answer depends on how the re-flowing is done. I must confess to being unfamiliar with Emacs; tsk, tsk. In my editor, "reflowing a paragraph" means converting it into a single, long string, while inserting "soft" line-breaks for on-screen display purposes only. That's the model I had in mind when I came up with the answer above. How common, or wide-spread, is the practice of inserting hard line-breaks (and hence creating separate strings)? — Mico, Nov 12 '17 at 12:19
I think there are three schools: 1) hard line-breaks every 80 chars or so; that's my case, hence I often use the Emacs fill-paragraph function which is bound to M-q (in AUCTeX LaTeX-fill-paragraph) 2) no hard breaks at all, typically, in entire paragraphs 3) a hard-break after each dot ending a sentence. In Emacs, a line longer than the window width will be reflowed for displaying but nothing is changed to actual source. Nevertheless I prefer by far hard wraps (M-q helps me reformat whenever I make additions) also for reasons related to usage of git. (I don't use latexdiff) — , Nov 12 '17 at 13:54
but to reply, I have no idea about respective percentages for the 1), 2), and 3) approaches from above comment. — , Nov 12 '17 at 13:57
@jfbu - Thanks. I'll edit my answer to put in a remark that the proposed approach may problematic (unusable?!) if one uses an editor that introduces "hard" line breaks while reflowing a paragraph. — Mico, Nov 12 '17 at 14:24
just to clarify: it is not Emacs itself which introduces hard line breaks; it is the Emacs user like me. Emacs does soft reflowing of overlong lines (without however introducing explicit zerowidth linebreak joiners), in its display. Simply, some people don't like that, and prefer to have hardcoded short lines. For once, I consider it more useful for git although because I do M-q on each edit to a paragraph, typically git will find two or three lines have changed. But they are shorter and I can compare them better. Again, Emacs does not hard format, it does only on user's demand. — , Nov 12 '17 at 15:46
@jfbu - Many thanks for this clarification. As one can tell rather easily, I am not an Emacs expert (nor would I ever dream to claim to be one...). — Mico, Nov 12 '17 at 18:41

Detect beginning of a sentence in a macro for capitalization

5 Answers5

Linked

Related