4

Is there a way to make a list of all the words that are being used in a Latex document? Alternatively, if someone knows another way to do it that could also be helpful, e.g. by using Python, a website, or something else

Here is an example of what I would like:

\documentclass{article}
\begin{document}
I have a dog and a cat.
The dog and the cat are named Bob and John.
\end{document} % Should maybe be after the list

list: I have a dog and cat the are named bob john

The order of the words in the list does not matter. And thank you if you can help.

Kevin
  • 73
  • 4
  • 1
    does the list of words change if you add $y = \log x$ ? – David Carlisle Feb 01 '22 at 15:17
  • @DavidCarlisle I'm not totally sure what you mean. It is me myself that have typed in the words in the list, but of course, I wanted it to be done automatically. – Kevin Feb 01 '22 at 15:21
  • 3
    I mean what is a "word" here, if for example you extract text out of the pdf after adding that math you would get the words x, y and log added, is that OK? – David Carlisle Feb 01 '22 at 15:21
  • 1
    The first problem is processing everything twice; once to parse the words and a second to generate the document. \everypar comes to mind, but has the problem that it doesn't really know when the paragraph ends. Second is that words can be separated by both spaces and punctuation. See also the self balancing binary tree in https://tex.stackexchange.com/questions/273037/expandable-quick-sort-array-macro/273476?r=SearchResults&s=1|18.3083#273476 – John Kormylo Feb 01 '22 at 15:23
  • 2
    why is a not in your list? – David Carlisle Feb 01 '22 at 15:23
  • @DavidCarlisle I see what you mean. I will prefer that the list doesn't include math, but it wouldn't be so much of a problem if it did, and in that case how the math will look like then isn't important, sens I am not going to use it anyway – Kevin Feb 01 '22 at 15:34
  • @DavidCarlisle Oh, that was just a mistake, I have added 'a' now – Kevin Feb 01 '22 at 15:37

1 Answers1

10

For some definition of "word" and "being used" you can extract the text from the PDF and process to a list.

pdflatex file1
pdftotext file1.pdf

will produce file1.txt

I have a dog and a cat. The dog and the cat are named Bob and John.

1

Which you can process with (standard linux utilities that would also be available on windows if needed, actually I am using cygwin versions on windows)

Then

cat file1.txt | tr '[:space:][,.]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort | uniq

Produces the list:

1
a
and
are
bob
cat
dog
have
i
john
named
the

The long command pipe is doing at each step:

  • replace white space and punctuation by newline
  • lowercase the resulting words
  • sort alphabetically
  • remove duplicates.
David Carlisle
  • 757,742