21

Is there a way to tell TeX to avoid breaking the page after the first word of a sentence?

...
...
He was quite dead. Apparently his neck had been
broken. The lightning flashed for a third time, and
his face leaped upon me. I sprang to my feet. It
(text continues on next page)

And then you have to turn the page for the rest of the sentence. It's not in a line on it's own so it can't be penalized like an orphan line.

Can TeX be told, to resolve these by, say, breaking the page before that first word?

Note:

Although the answers given below are very informative, the general consensus had been that the best practice is to leave this for the proofreaders to spot, and then fix manually.

Gambhiro
  • 3,384
  • 1
    Despite the fact that TeX can be unhelpful at times, when I find myself asking how to make it do something non-default with respect to page/line breaking, I also ask if that is something I really want (and if so, how badly do I want it). Imagine if in your example "it" were instead a long word. TeX would then be left to distribute a large empty space left by that word among the rest of the line. A computer algorithm can't hope to satisfy something so subjective as an aesthetic judgment. But it gets close, and sometimes a human has to correct the preprint, e.g. by redistributing page breaks. – Aaron Mar 24 '11 at 05:23
  • @Aaron: Yes, I tend to think that this is not TeX's job, but a proofreader's. – Gambhiro Mar 24 '11 at 08:18

4 Answers4

12

EDIT: I forgot to mention that although this whole answer works in simple cases, it is a bad idea to rely on it for anything serious, since it can break in many different ways. Typically, catcode changes are a bad idea...

EDIT: Lev Bishop pointed out that inserting \nopagebreak after each first word of a sentence is too much, because it will forbid line breaks after each line containing the first word of a sentence. Here, I fixed this problem by using the auxiliary file, and checking the page number on both sides of the space following the first word of the sentence.

It is also possible to make ., !, ? active, let them read the next word and place \nopagebreak after the first word of each sentence (except the first one of a paragraph).

Thing are more complicated if we still want to use . in dimensions (e.g., width=3.4cm in \includegraphics). Also, the last punctuation of a paragraph needs special treatment (in particular when the paragraph does not quite finish with that punctuation (e.g. quotes)...).

Hopefully, the code below works. Currently, I've inserted * after \nopagebreak, just to visualize the places where a \nopagebreak is inserted. Of course, remove it.

\documentclass[a5paper]{article}


\makeatletter
% \begin{macro}
% The code below inserts "\eos@text" each time a space following the
% first word of the sentence falls on the separation between two pages.
%    \begin{macrocode}
\newcommand{\eos@text}{\nopagebreak[4]*}
%    \end{macrocode}
% \end{macro}
% 
% \begin{macro}{\eos@active,\eos@active@text}
%   
%   "#1" is the character (".", "!", "?") that ended the sentence.
%   We distinguih various cases depending on the following non-space 
%   character, "#2". In every case, we start by putting the
%   punctuation "#1" back.
%   
%   If "#2" is a digit, we assume that we are in the middle of a
%   number such as "width=5.3em" in, say, "\includegraphics".
%   (This is only relevant for ".", though.)
%   
%   If "#2" is "\par", that means that the punctuation is the last 
%   one in the paragraph, so we can safely do nothing.
%   
%   If "#2" is a quote, we need to treat things differently. (Here
%   we actually pretend that the quote is in fact the end-of-sentence.)
%   
%   Finally, in every other case, we grab the first word and place
%   a non-breakable space afterwards.
%   
%   In each case, we put back what directly followed the punctuation
%   right after our test.
% 
%   \begin{macrocode}
\newcommand{\eos@active}[2]{%
  #1%
  \ifnum9<1#2\space 
  \else
    \ifx\par#2%
    \else
      \ifx'#2%
        \expandafter\expandafter\expandafter\expandafter
        \expandafter\expandafter\expandafter\eos@active
      \else
        \expandafter\expandafter\expandafter\expandafter
        \expandafter\expandafter\expandafter\eos@active@text
      \fi
    \fi
  \fi
  #2%
}
%    \end{macrocode}
%    Grabbing the following word: the first "\newcommand" checks that
%    the command is not already defined. Then we define it through "\def"
%    because its argument is a bit more complicated than usual, delimited
%    by a space. Also to note is the initial space (before "#1"): that 
%    was lost in our test, and we put it back.
%    
%    Earlier, we were putting a "\nopagebreak" after that first word,
%    but now, we do something more tricky, only putting a "\nopagebreak"
%    if at the previous run of LaTeX there was a page break there.
%    
%    \begin{macrocode}
\newcommand{\eos@active@text}{}
\def\eos@active@text#1 { #1\eos@space}
%    \end{macrocode}
% \end{macro}

% \begin{macro}{\eos@space}
%   As Lev Bishop mentions, putting "\nopagebreak" forbids a page break
%   after the current line. So we don't want to always insert a page 
%   break! The test \emph{is} crazy\ldots Too lazy to explain the 
%   details. "\count0" is the page number, "\write" rather than 
%   "\immediate\write" in order to get the page number when typeset
%   rather than when read. "\csname eos@mark@\the\count0\endcsname"
%   creates a control sequence (equal to relax) corresponding to the
%   page number. And the test "\eos@pagetest" checks whether the
%   control sequence corresponding to the page \emph{after} the space
%   is already defined. If it is, we write something to the aux file.
%   
%   
%   \begin{macrocode}
\newcount\eos@current
\newcount\eos@pageno
\newcommand{\eos@space}{%
  \advance\eos@current by\@ne
  \write\@mainaux{\relax
    \expandafter\@gobble\csname eos@mark@\the\count0\endcsname}%
  \csname eos@\romannumeral\eos@current\endcsname
  \space
  \write\expandafter\@mainaux\expandafter{%
    \expandafter\eos@pagetest\expandafter{\romannumeral\eos@current}\relax}%
}
%   \end{macrocode}
% \end{macro}
% 
% \begin{macro}{\eos@pagetest}
%   If the page number is a brand new page number (i.e. if 
%   "\csname eos@mark@\the\count0\endcsname" is not yet defined),
%   we write something to the aux file. Otherwise, we don't do anything.
%    \begin{macrocode}
\newcommand{\eos@pagetest}[1]{%
  \unless\ifcsname eos@mark@\the\count0\endcsname
  \noexpand\eos@rewrite{\gdef\csname eos@#1\endcsname{\noexpand\eos@text}}%
  \fi
}%
\newcommand{\eos@rewrite}[1]{#1%
  \ifx\usepackage\documentclass
  \expandafter\@gobble
  \else
  \expandafter\AtBeginDocument
  \fi
  {\immediate\write\@mainaux{\unexpanded{\eos@rewrite{#1}}}}%
}
%    \end{macrocode}
%    "\eos@rewrite" is meant for use in the aux file, and 
%    rewrites itself to the aux file. The test is very bad, 
%    checks whether we are reading the aux file at the start
%    or the end of the document (any better test?).
%
%    If we didn't rewrite, a space that changes page would lead
%    to inserting "\nopagebreak[4]", but at the next run that would
%    prevent a page break. Then the space would no longer be at the
%    change of a page. So it would not insert "\nopagebreak[4]" for
%    the next run. Thus, in the next run, the space would (probably)
%    be at the change of pages again, etc. 
%    
%    So we make that "\nopagebreak" resilient. If you need to reset 
%    all of this, just delete the .aux file.
% \end{macro}
%  

% \begin{macro}{\activate@eos}
%     It's better to make ".", "!", "?" at "\begin{document}".
%     For that we define "\activate@eos" which makes its 
%     argument active, and defines it to be an end-of-sentence ("eos").
%     \begin{macrocode}
\newcommand{\activate@eos}[1]{%
  \begingroup
  \lccode`\~`#1\space
  \lowercase{%
    \endgroup
    \catcode`#1=13\relax
    \newcommand{~}{\eos@active{#1}}%
  }%
}
\AtBeginDocument{%
  \activate@eos{.}%
  \activate@eos{!}%
  \activate@eos{?}%
}
%     \end{macrocode}
%   Am I missing a possible end-of-sentence marker?
% \end{document}
\makeatother


% ==========================================================
% Just for demonstration
\usepackage[text={5cm,36pt}]{geometry}


\begin{document}

% We repeat the text until it fills 10 pages.
\loop\ifnum\count0<10\relax 
Greetings. He will. Will he? No, he won't. Maybe not, anyways. Although, perhaps. And that changes. Constantly. Is it worth? It really is not. Short sentences, why? To test better. Make sure it works!

I'm lazy. So copy. And paste. Repeating the same. Many times. Of course! Just a bit more.

\repeat

\end{document}
  • This is scarily brilliant (and well-commented to boot). I'd hesitate to rely on it myself because of the dangers inherent in fiddling with catcodes (especially as one adds more foreign code to a document), but chapeau all the same. – Aaron Mar 24 '11 at 05:29
  • This is really neat, thanks for the extensive comments as well! – Gambhiro Mar 24 '11 at 07:59
  • @Aaron: thank you. I agree that I wouldn't use it myself. See the edit to answer Lev's answer. – Bruno Le Floch Mar 24 '11 at 18:51
  • I do agree with other posters: leave this to proof-readers. – Bruno Le Floch Mar 24 '11 at 18:51
  • You should wrap the aux-writing in \if@filesw and write to \@auxout. The \eos@rewrite stuff seems clumsy. Try copying how latex itself does \labels (you could even simply use two \label and check if the \pageref is unchanged). But otherwise, this is certainly a lot closer to what is needed. (Still can fail if TeX inserts one of its other penalties, though). – Lev Bishop Mar 24 '11 at 19:29
  • OK, having thought about it some more and reading your commented code, I understand the need for the rewrite stuff. – Lev Bishop Mar 24 '11 at 19:33
  • @Lev Bishop: Unfortunately, the issue of other penalties cannot be avoided. Anyone wants to re-code TeX ;-)? Thanks for your other comments, I'll try to take some time later this week. – Bruno Le Floch Mar 24 '11 at 20:22
  • @Bruno: Your \lowercase approach looks ingenious, but can't one simply do it along the lines \def\period{.}\catcode\.=\active\newcommand.[1]{\period...`? – Hendrik Vogt Mar 25 '11 at 15:06
  • @Hendrik: FYI, that \lowercase trick is quite standard. The point is that I want to make . etc active at \begin{document} (otherwise some packages break). And as you know, catcodes cannot be changed inside an argument. – Bruno Le Floch Mar 25 '11 at 15:13
  • @Bruno: I know that it's a usual trick, but I'm still impressed whenever I see it. It's just not something I've used myself. Thanks for the explanation about \begin{document}. I read about that in your code, but forgot again when thinking about the short version I proposed above. – Hendrik Vogt Mar 25 '11 at 15:30
  • @Hendrik: a variant closer to what you say is to make . active and define it as you do, then restore it, and \AtBeginDocument{\catcode\.=12\relax}`. – Bruno Le Floch Mar 25 '11 at 21:41
  • @Bruno: You mean 13, right? But whatever method, maybe one just shouldn't do it. It does still have the chance of breaking other packages, doesn't it? (I can only think of delimited macros at the moment; is there more?) – Hendrik Vogt Mar 25 '11 at 22:12
  • @Hendrik: 13, yes. Even without packages, that code will break if the last sentence of a paragraph is one word long, or if you try to do \hbox{Some text.}, because . will look ahead to grab the first word... I'll add "don't do it" to the answer. – Bruno Le Floch Mar 25 '11 at 23:18
  • @Bruno: The "one word long" problem should have an easy solution, but I didn't think of \hbox & friends - thanks. And one can't solve this with \futurelet either as I slowly understood when seeing your answer. I first wondered why you make it fully expandable: It's for making it work in dimensions, right? I still find that aspect of your code very neat because I encountered the same problem in this answer. – Hendrik Vogt Mar 26 '11 at 08:19
  • @Hendrik: indeed, to work inside dimensions, . has to be expandable. So no \futurelet. But to test whether we are in a dimension or not, the only way I see is to grab an undelimited argument. That loses a leading space. And I put it back by hand. But if quotes closed at that place, there should be no space, hence a special treatment is needed. I guess I probably missed quite a few other cases. Besides, all this should only be put near the end, because the non-page-breakable spaces accumulate in the auxiliary file. – Bruno Le Floch Mar 26 '11 at 11:14
  • cont. On this last issue, perhaps it would be possible to test if a space breaks to the next line when TeX has broken the paragraph into lines, but not yet the text into pages? Something close to when the lineno package acts. – Bruno Le Floch Mar 26 '11 at 11:23
  • @Bruno: Thanks for the confirmation about \futurelet. Sorry, the other things are becoming too difficult for me now, but I enjoyed the enlightening discussion. – Hendrik Vogt Mar 26 '11 at 16:48
9

(This is not really an answer, but it is much too long for a comment).

There are several answers already given that use various methods to insert \nopagebreak after the first word of a sentence. Unfortunately, none of these will work very well, because they will both inhibit too many pagebreaks and simultaneously not inhibit all the desired pagebreaks.

To see that the former is true, we need to realize that \nopagebreak is effectively equivalent to \vadjust{\penalty10000}. In other words, this will attempt to prohibit any page break following a line containing the first word of a sentence (ie, not just those lines where the first word of a sentence is the last word of the line).

To see that the latter is true, we need to realize that TeX will insert various penalties into the vertical list by itself under specific circumstances. A probably incomplete list: \interlinepenalty, \clubpenalty, \widowpenalty, \brokenpenalty, \displaywidowpenalty, \predisplaypenalty, \postdisplaypenalty. If one of these circumstances comes up, and the corresponding penalty is less than 10000, then there still might be a break (when there are 2 or more consecutive penalties, TeX is allowed to break at any one of them that it finds convenient).

It is likely possible to achieve what you want using a special marker value for the penalty that is then treated specially by the output routine (similar to how plain tex \supereject puts a penalty of -20000 that is treated specially by its output routine). It can probably also be done using LuaTeX, but the solutions given so far will not do what you want (at least, not reliably).

Lev Bishop
  • 45,462
  • Yes, Bruno's answer for example is impressive, but I keep feeling that in the end this issue should be left for the proofreaders to spot, and then manually fix. – Gambhiro Mar 24 '11 at 08:15
  • I partially fixed that: now the \nopagebreak is inserted if at any previous run the two sides of the space are on different pages. – Bruno Le Floch Mar 24 '11 at 18:53
6

A possible, but non-TeX resolution could be to sed the text source this way:

sed 's/\([\!?\."'"'"'] \+\)\([^ ]\+\) \+/\1\2\\nopagebreak\\ /g' file.tex > file.sed.tex

It's a bit of an insecure trick, but it will turn

"I was walking through the roads to clear my brain," he said. "And suddenly--fire, earthquake, death!" He relapsed into silence, with his chin now sunken almost to his knees. Presently he began waving his hand. "All the work--all the Sunday schools--What have we done--what has Weybridge done?

Into:

"I was walking through the roads to clear my brain," he\nopagebreak\ said. "And\nopagebreak\ suddenly--fire, earthquake, death!" He\nopagebreak\ relapsed into silence, with his chin now sunken almost to his knees. Presently\nopagebreak\ he began waving his hand. "All\nopagebreak\ the work--all the Sunday schools--What have we done--what has Weybridge done?

(It doesn't change the first `"I was..." because the line (and the paragraph) begins there)

Gambhiro
  • 3,384
5

I never heard about some automatic feature which does that. You can of course use ~ (unbreakable space) between the first and second word of the particular or every sentence. Then (La)TeX will never break the line at this point.

Better would be to add \nopagebreak between this words. However doing so in every sentence is much more obstructive than ~.

Martin Scharrer
  • 262,582