6

I have a thesis of approximately 100,000 words, typeset in latex. I have rather inconsistently hyphenated some of the words, for example "spider-fear" and "spider fear".

I would like to get a list of all words in the tex files that are hyphenated (along with a count) and then I would also like a count for the number of times that the unhyphenated version also appears.

Presumably using a tool like awk, grep or sed?

1 Answers1

3

You can do this by means of a spiffy Perl program, texcount.pl, which you can download from this Web page. This program counts words in TeX documents (or letters, or mathematical formulas, ...), a non-trivial task given the presence of keywords specific to TeX which are to be excluded from from the count. The program has a number of features and options (which however I never used), but the one you need is:

   texcount.pl -freq myfile.tex

which will return the full list of words used (to standard output) with their frequency of appearance. You can then easily parse this to see when you have used hyphenated or non-hyphenated combinations. Please notice that the program can easily include multi-file projects, where sections, appendices, bibliography and so on are stored in different files. It will not, however, (or at least, AFAIK) point to the precise location of the words: you will have to hunt them down one by one.

Edit:

A quick but partial solution to finding all occurrences of the non-hyphenated expressions is the following:

  grep 'spider *fear' file.tex -n

which searches for the two words separated by zero or more (the * symbol) spaces, and returns the line number (the -n option) of this occurrence. This is fast, but it is incomplete because the use of grep automatically implies that one cannot locate the expressions spider fear whenever these are split into two or more lines. Since for arbitrary expressions this can occur even within words, finding these occurrences will require a tad more work than I am willing to do.

Edit 2:

Another bit of the solution is the following:

   grep 'spider *$'  -A 1 filename | grep '^ *fear' -n

This will search for all lines which end with spider followed by an unspecified number of white spaces, followed by another line beginning with an unspecified number of spaces and then the word fear. In doing so, It will also output the line number of this occurrence.

Keep in mind that, in all of the previous cases, you are searching for lower-case expressions only. If you wish to include capitals, just substitute grep -i for grep.

The only part that is missing now is when words are broken between different lines, like in

    spi
    der
MariusMatutiae
  • 47,503
  • 12
  • 81
  • 131
  • That's interesting. I already use texcount for wordcount, but didn't realise it could give me frequency information. – Frank Zafka Dec 09 '13 at 19:23
  • It certainly fulfils the first part of my requirements, as I now have frequency information for all words (hyphenated or not) in my document. It is not clear how to find the same information for the nonhyphenated versions. For example: fear = 89. spider-fear = 93. spider = 114.

    Not all spider and fear are unhyphenated version of spider-fear.

    – Frank Zafka Dec 09 '13 at 19:31
  • Interesting addition. – Frank Zafka Dec 10 '13 at 10:34
  • @Frank_Zafka I have added another bit. – MariusMatutiae Dec 13 '13 at 10:26
  • Thinking about it, what is a new line? Since I am writing in Latex, aren't all paragraphs just one long line? Paragraphs are denoted by leaving a blank line. In that case none of the words should hang over the line as in your example. But then again I am more of a latex person than a grep person. I am a fellow archer though...thanks for the further input. – Frank Zafka Dec 13 '13 at 10:51
  • @Frank_Zafka Arch rules. As for the newlines, there are bound to be some. It is true some people have one only when they start a new paragraph (not my case, though), and it is true most people don't break words any longer. Still, a good programming job should take care of all such occurrences. – MariusMatutiae Dec 13 '13 at 11:02