How to make a list of all words in Latex?

Question

Is there a way to make a list of all the words that are being used in a Latex document? Alternatively, if someone knows another way to do it that could also be helpful, e.g. by using Python, a website, or something else

Here is an example of what I would like:

\documentclass{article}
\begin{document}
I have a dog and a cat.
The dog and the cat are named Bob and John.
\end{document} % Should maybe be after the list
list:
I 
have
a 
dog 
and 
cat
the 
are 
named 
bob 
john

The order of the words in the list does not matter. And thank you if you can help.

@DavidCarlisle I'm not totally sure what you mean. It is me myself that have typed in the words in the list, but of course, I wanted it to be done automatically. — Kevin, Feb 01 '22 at 15:21
I mean what is a "word" here, if for example you extract text out of the pdf after adding that math you would get the words x, y and log added, is that OK? — David Carlisle, Feb 01 '22 at 15:21
The first problem is processing everything twice; once to parse the words and a second to generate the document. \everypar comes to mind, but has the problem that it doesn't really know when the paragraph ends. Second is that words can be separated by both spaces and punctuation. See also the self balancing binary tree in https://tex.stackexchange.com/questions/273037/expandable-quick-sort-array-macro/273476?r=SearchResults&s=1|18.3083#273476 — John Kormylo, Feb 01 '22 at 15:23
@DavidCarlisle I see what you mean. I will prefer that the list doesn't include math, but it wouldn't be so much of a problem if it did, and in that case how the math will look like then isn't important, sens I am not going to use it anyway — Kevin, Feb 01 '22 at 15:34
@DavidCarlisle Oh, that was just a mistake, I have added 'a' now — Kevin, Feb 01 '22 at 15:37

score 10 · Accepted Answer · answered Feb 01 '22 at 15:34

For some definition of "word" and "being used" you can extract the text from the PDF and process to a list.

pdflatex file1
pdftotext file1.pdf

will produce file1.txt

I have a dog and a cat. The dog and the cat are named Bob and John.
1

Which you can process with (standard linux utilities that would also be available on windows if needed, actually I am using cygwin versions on windows)

Then

cat file1.txt | tr '[:space:][,.]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort | uniq

Produces the list:

1
a
and
are
bob
cat
dog
have
i
john
named
the

The long command pipe is doing at each step:

replace white space and punctuation by newline
lowercase the resulting words
sort alphabetically
remove duplicates.

How to make a list of all words in Latex?

1 Answers1