How to generate an automatic index (concordance) in a large file?

Question

I'd like to generate an index by inputing a large file (1189 chapters, 31'171 verses/sentences)

and if possible to set

restrictions from minimal word length
restrictions to specific words to exclude

input file: direct download link txt file

original download link compressed folder

1 Mose 1
1 ¶ Im Anfang schuf Gott den Himmel und die Erde.
2  Und die Erde war wüst und leer, und es lag Finsternis auf der Tiefe, und der Geist Gottes schwebte über den Wassern.
3 ¶ Und Gott sprach: Es werde Licht! Und es ward Licht.
4  Und Gott sah, daß das Licht gut war; da schied Gott das Licht von der Finsternis;
5  und Gott nannte das Licht Tag, und die Finsternis Nacht. Und es ward Abend, und es ward Morgen: der erste Tag.
6 ¶ Und Gott sprach: Es soll eine Feste entstehen inmitten der Wasser, die bilde eine Scheidewand zwischen den Gewässern!
7  Und Gott machte die Feste und schied das Wasser unter der Feste von dem Wasser über der Feste, daß es so ward.
8  Und Gott nannte die Feste Himmel. Und es ward Abend, und es ward Morgen: der zweite Tag.
9 ¶ Und Gott sprach: Es sammle sich das Wasser unter dem Himmel an einen Ort, daß man das Trockene sehe! Und es geschah also.
10  Und Gott nannte das Trockene Land; aber die Sammlung der Wasser nannte er Meer. Und Gott sah, daß es gut war.
11  Und Gott sprach: Es lasse die Erde grünes Gras sprossen und Gewächs, das Samen trägt, fruchtbare Bäume, deren jeder seine besondere Art Früchte bringt, in welcher ihr Same sei auf Erden! Und es geschah also.
12  Und die Erde brachte hervor Gras und Gewächs, das Samen trägt nach seiner Art, und Bäume, welche Früchte bringen, in welchen ihr Same ist nach ihrer Art. Und Gott sah, daß es gut war.
13  Und es ward Abend, und es ward Morgen: der dritte Tag.
14 ¶ Und Gott sprach: Es seien Lichter an der Himmelsfeste, zur Unterscheidung von Tag und Nacht, die sollen zur Bestimmung der Zeiten und der Tage und Jahre dienen,
15  und zu Leuchtern an der Himmelsfeste, daß sie die Erde beleuchten! Und es geschah also.
16  Und Gott machte die zwei großen Lichter, das große Licht zur Beherrschung des Tages und das kleinere Licht zur Beherrschung der Nacht; dazu die Sterne.
17  Und Gott setzte sie an die Himmelsfeste, damit sie die Erde beleuchteten
18  und den Tag und die Nacht beherrschten und Licht und Finsternis unterschieden. Und Gott sah, daß es gut war.
19  Und es ward Abend, und es ward Morgen: der vierte Tag.
20 ¶ Und Gott sprach: Das Wasser soll wimmeln von einer Fülle lebendiger Wesen, und es sollen Vögel fliegen über die Erde, an der Himmelsfeste dahin!
21  Und Gott schuf die großen Fische und alles, was da lebt und webt, wovon das Wasser wimmelt, nach ihren Gattungen, dazu allerlei Vögel nach ihren Gattungen. Und Gott sah, daß es gut war.
22  Und Gott segnete sie und sprach: Seid fruchtbar und mehret euch und füllet das Wasser im Meere, und das Geflügel mehre sich auf Erden!
23  Und es ward Abend, und es ward Morgen: der fünfte Tag.
24 ¶ Und Gott sprach: Die Erde bringe hervor lebendige Wesen nach ihrer Art, Vieh, Gewürm und Tiere des Feldes nach ihrer Art! Und es geschah also.
25  Und Gott machte die Tiere des Feldes nach ihrer Art und das Vieh nach seiner Art. Und Gott sah, daß es gut war.
26 ¶ Und Gott sprach: Wir wollen Menschen machen nach unserm Bild uns ähnlich; die sollen herrschen über die Fische im Meer und über die Vögel des Himmels und über das Vieh auf der ganzen Erde, auch über alles, was auf Erden kriecht!
27  Und Gott schuf den Menschen ihm zum Bilde, zum Bilde Gottes schuf er ihn; männlich und weiblich schuf er sie.
28  Und Gott segnete sie und sprach zu ihnen: Seid fruchtbar und mehret euch und füllet die Erde und machet sie euch untertan und herrschet über die Fische im Meer und über die Vögel des Himmels und über alles Lebendige, was auf Erden kriecht!
29 ¶ Und Gott sprach: Siehe, ich habe euch alles Gewächs auf Erden gegeben, das Samen trägt, auch alle Bäume, an welchen Früchte sind, die Samen tragen; sie sollen euch zur Nahrung dienen;
30  aber allen Tieren der Erde und allen Vögeln des Himmels und allem, was auf Erden kriecht, allem, was eine lebendige Seele hat, habe ich alles grüne Kraut zur Nahrung gegeben. Und es geschah also.
31 ¶ Und Gott sah an alles, was er gemacht hatte, und siehe, es war sehr gut. Und es ward Abend, und es ward Morgen: der sechste Tag.

1 Mose 2
1 ¶ Also waren Himmel und Erde vollendet samt ihrem ganzen Heer,
2  so daß Gott am siebenten Tage sein Werk vollendet hatte, das er gemacht; und er ruhte am siebenten Tage von allen seinen Werken, die er gemacht hatte.
3  Und Gott segnete den siebenten Tag und heiligte ihn, denn an demselbigen ruhte er von all seinem Werk, das Gott schuf, als er es machte.
4 ¶ Dies ist die Entstehung des Himmels und der Erde, zur Zeit, als Gott der HERR Himmel und Erde schuf.
5  Es war aber noch kein Strauch des Feldes auf Erden, noch irgend ein grünes Kraut auf dem Felde gewachsen; denn Gott der HERR hatte noch nicht regnen lassen auf Erden, und es war kein Mensch vorhanden, um das Land zu bebauen.
6  Aber ein Dunst stieg auf von der Erde und befeuchtete die ganze Erdoberfläche.

\exclude{word=Aaron}

The output file should look somehow like this:

Abend   1 Mose 1,5; 1 Mose 1,8; 
        1 Mose 1,13; 1 Mose 1,19; 
        1 Mose 1,23; 1 Mose 1,31; 
        1 Mose 3,8; ….; 3 Mose 24,3; 
        …; Matthäus 14,15; …
Aber    1 Mose 8,9; 1 Mose 8,20;

In the example I've limited the column width, so it's possible to print a paper with multiple columns (I propably prefer 4 columns).

I've consulted webpages like http://www.mrunix.de/forums/archive/index.php/t-54191.html to find a solution, but didn't get the result I am looking for.

Is there an easy way to generate this index?

If someone knows a good python or php forum within stackexchange which would be more suitable for my requirements please let me know as well.

As reply to one comment I would be very happy if the entries in the concordance are sorted alphabetically in following order.

1. Mose
2. Mose
3. Mose
4. Mose
5. Mose
Josua
Richter
Ruth
1. Samuel
2. Samuel
1. Könige
2. Könige
1. Chronik
2. Chronik
Esra
Nehemia
Esther
Hiob
Psalmen
Sprüche
Prediger
Hohelied
Jesaja
Jeremia
Klagelieder
Hesekiel
Daniel
Hosea
Joel
Amos
Obadja
Jona
Micha
Nahum
Habakuk
Zephanja
Haggai
Sacharja
Maleachi
Matthäus
Markus
Lukas
Johannes
Apostelgeschichte
Römer
1. Korinther
2. Korinther
Galater
Epheser
Philipper
Kolosser
1. Thessalonicher
2. Thessalonicher
1. Timotheus
2. Timotheus
Titus
Philemon
Hebräer
Jakobus
1. Petrus
2. Petrus
1. Johannes
2. Johannes
3. Johannes
Judas
Offenbarung

Books which do not have chapters are the followings

Obadja
Philemon
2. Johannes
3. Johannes
Judas

If it's easier I could add a number 1 in the input file to for books which don't have chapters (like Objada > Objada 1)... If it's a lot of work to sort the concorcance as the ordering in the code section, it would also be okay if the sorting happens alphabetically...

Before getting the first answer I forgot to tell, I only want to use software which are freely available (no shareware).

The output file should look like this: http://www.gurt-der-wahrheit.org/files/konkordanz_schlachter_1951_A4.pdf

You could use regular expressions. In Emacs, you could try M-x replace-regexp RET \([a-zA-Z]\w*\>\) RET \\Index{\1}. This would replace every 'word' (but not number) so that each word is wrapped by \Index{}. Then define \Index appropriately. But then you'd also want to 'fix' the chapter and verse numbers appropriately so that the \Index command does the right thing. — jon, Oct 30 '17 at 01:31
IThis seems to be a possible way. I guess there are approximately 740'000 index commands inside the document. Will the software crash when generating the index? I'd like to try. How can I automatically vombine the verse number with the correct chapter reference? Please give me an example. — laminin, Oct 30 '17 at 07:57
Excluding the commonest words in your language is an obviously good idea - for example you don't want to include "the." But excluding all short words might not be a good idea - for example in an English bible you probably want to include short words like "sin", "die", "eye", "ear", etc. — alephzero, Oct 30 '17 at 10:11
Look at the documentation to see how \index handles things like formatting a page number in bold or italic text, and entries like "see", "see also", etc. One way to deal with "chapter and verse" is to invent a fictitious page number = (1000 x the chapter + the verse) (assuming there are no chapters with more than 999 verses) and then unpack it again when you print the index. Don't forget the special case of some short books which have no chapters, only verse numbers! — alephzero, Oct 30 '17 at 10:16
The question is how to make the reference. First I need a command to replace the verse number with the reference text. Then I need a command which connects the reference text with the index words. Does anyone know how to do this? — laminin, Oct 30 '17 at 18:50
I don't think I have the time to take the a completely unmarked up text of the Bible and create a word concordance. There are several tricky things involved in this that you have not mentioned or not thought carefully about. Among these include the short books with no chapters, and the order of the books themselves in terms of how they are to be sorted in the concordance (in order of the Bible itself or alphabetical?). I suspect there are others, though I can't think of what they might be right now. — jon, Oct 31 '17 at 02:43
... However, building off my first comment, I'd suggest writing a command to each new line of your input files that provides information that \Index can hook into; and for each first newline in your input files (where the book name is), write a different command that sets other information the \Index command can use. Also: put each chapter of each book in a separate file. — jon, Oct 31 '17 at 02:46
@jon Concerning your first comment I write in the description how I wish the concorcande is ordering it. Nonregarding which type of ordering I will give the bounty for the first complete code which works. — laminin, Oct 31 '17 at 19:23
@jon: it's exactly what I don't know, how to modity the input file so that Index can hook into... — laminin, Oct 31 '17 at 19:24
Do you want the concordance entries to be hyperlinks to the actual text? Should they contain page numbers, or just simply list the book and verse? If there are no page numbers and no hyperlinks, you do not need to alter the source text at all; you can simply generate the concordance entirely with an external program. If you do want hyperlinks or page numbers, you can still do it with an automatic external program. I can write one if you wish. — Michael Palmer, Oct 31 '17 at 19:39
No page numbers, no hyperlinks are needed. Yes please write me the complete code. The concordance should look similar to following document: http://www.gurt-der-wahrheit.org/files/konkordanz_schlachter_1951_A4.pdf Yes please write me the code. — laminin, Oct 31 '17 at 20:02
@laminin - OK. I guess we will have to work on this together a bit, for example on pruning the word list. It would probably be best to discuss by email. You can email me at mpalmer at uwaterloo dot ca. — Michael Palmer, Nov 01 '17 at 02:00
Here is a link to "stop words" in German and other languages: https://github.com/Alir3z4/stop-words — Michael Kane, Nov 01 '17 at 06:06
It is not clear why you are trying to do this in (La)TeX. As far as I can tell, you just have a simple text file, with a “title” (book/chapter) followed by words, and you want to create a mapping associating words to the titles they fall under. Is that right? What part of this has anything to do with TeX? (But see also this related earlier question.) — ShreevatsaR, Nov 02 '17 at 02:48
@ShreevatsaR the input is regular enough to do this by making space and ^^M active, and change the catcodes of punctuation to null. It's certainly possible with TeX, just not necessary. Although I get a kick out of this challenge — A Gold Man, Nov 02 '17 at 18:06
@AGoldMan I know with TeX with changing the input file I would probably able to know how to generate a PDF, with some effort. Which other software could be used to get a PDF similar like the http://www.gurt-der-wahrheit.org/files/konkordanz_schlachter_1951_A4.pdf ? — laminin, Nov 02 '17 at 18:43
@laminin my plan actually doesn't involve changing the original file. Hopefully I'll get a chance to post it soon. — A Gold Man, Nov 02 '17 at 18:50

ShreevatsaR · Accepted Answer · 2017-11-07T23:42:19.190

The output file should look like this: http://www.gurt-der-wahrheit.org/files/konkordanz_schlachter_1951_A4.pdf

Sure, it is possible. How about this?

The complete concordance, of which the above is page 2, was generated by the following file (compile with lualatex rather than pdflatex):

\documentclass[a4paper]{article}
\usepackage{luatex85} % a4paper doesn't seem to take effect otherwise
\usepackage[margin=1cm, top=0.5cm, footskip=0.5cm]{geometry}
\usepackage{fontspec}
\setmainfont{Arial}

\begin{document}
\parindent=0pt \twocolumn \scriptsize
\pretolerance=-1 \sloppy  % Skip the first pass, and avoid overfull boxes
\spaceskip=\fontdimen2\font plus 2\fontdimen3\font minus \fontdimen4\font % Fewer underfull warnings, by allowing more stretch
\directlua{dofile('concordance.lua')}
\directlua{words, locations = concordance('bibschl.txt', 'latin1')}
\directlua{printConcordance(words, locations, {minLength=2, otherExclusions={'Aaron','RECHT','zorn'}})}
\end{document}

where concordance.lua is what generates the concordance (for each word, find all the places where it occurs) and typesets individual entries (bold keys, semicolons separating the locations, etc.):

function concordance(filename, encoding)
   -- Given a file that has the following structure:
   --         <blank line>
   --         BookName<space>ChapterNumber
   --         VerseNumber<space>Verse
   --         VerseNumber<space>Verse
   --         ...
   --         <blank line>
   --         BookName<space>ChapterNumber
   --         ...
   -- (Each verse itself is a sequence of space-separated words, ignoring case and trailing punctuation.)
   -- Returns two tables: (1) the words, in sorted order, and (2) mapping words to locations (book, chapter, verse)
   local readBookNext = true  -- Whether the *next* line contains Book & Chapter
   local currentBook = ''
   local currentChapter = 0
   local concordanceTable = {}
   for line in io.lines(filename) do
      line = makeUTF8(line, encoding)  -- Just in case encoding='latin1'
      if line == '' then readBookNext = true
      elseif readBookNext then
         currentBook, currentChapter = string.match(line, '^(.*) ([0-9]*)$')
         readBookNext = false
      else
         verseNumber, verse = string.match(line, '^([0-9]*) (.*)$')
         for word in string.gmatch(verse, '%S+') do
            addWordToConcordance(word, currentBook, currentChapter, verseNumber, concordanceTable)
         end
      end
   end
   local keys = {}
   for word, _ in pairs(concordanceTable) do table.insert(keys, word) end
   table.sort(keys)
   return keys, concordanceTable
end

local badFirsts = {['¶'] = true, ['«'] = true, ['-'] = true, ['<']=true, ['(']=true, [',']=true}
local badLasts = {['.']=true, [',']=true, [':']=true, ['!']=true, [';']=true, ['?']=true, ['»']=true, [')']=true, ["'"]=true, ['>']=true, ['`']=true}
function addWordToConcordance(origWord, book, chapter, verse, concordanceTable)
   -- In `concordanceTable`, adds (book, chapter, verse) to the entry for word
   local word = unicode.utf8.upper(origWord)
   while badFirsts[unicode.utf8.sub(word, 1, 1)] do word = unicode.utf8.sub(word, 2) end  -- Strip leading punctuation
   while badLasts[unicode.utf8.sub(word, -1)] do word = unicode.utf8.sub(word, 1, -2) end -- Strip trailing punctuation
   if string.match(word, '^[0-9-]*B?$') then return end  -- Ignore empty words and words like "42-3" or "29-39B"
   local list = concordanceTable[word] or {}
   table.insert(list, {book=book, chapter=chapter, verse=verse})
   concordanceTable[word] = list
end

function makeUTF8(line, encoding)
   -- Converts text `line` from latin1 (ISO-8859-1) to UTF-8, if necessary.
   if encoding == 'utf8' or encoding == nil then
      return line
   elseif encoding == 'latin1' then
      local utf8Line = ''
      for c in string.gmatch(line, '.') do utf8Line = utf8Line .. unicode.utf8.char(string.byte(c)) end
      return utf8Line
   else
      error(string.format('Unknown encoding "%s"', encoding))
   end
end

--------------------------------------------------------------------------------
-- Above are functions that generate the concordance; below are functions for injecting that into TeX
--------------------------------------------------------------------------------
function printConcordance(words, locations, options)
   options = options or {}
   local includeThreshold = options['includeThreshold'] or 300 -- Words that occur too often are dropped
   local breakThreshold = options['breakThreshold'] or 1000    -- A "paragraph" break is added after enough entries
   local minLength = options['minLength'] or 1                 -- Words shorter than this length are dropped
   local otherExclusions = {}                                  -- Words in this table are dropped
   for _, ex in ipairs(options['otherExclusions'] or {}) do
      otherExclusions[unicode.utf8.upper(ex)] = true
   end
   local numPrinted = breakThreshold + 100  -- more than breakThreshold: we want a “break” before the first word
   for _, word in ipairs(words) do
      tex.print([[\hskip 1.5\fontdimen2\font plus 5\fontdimen3\font minus \fontdimen4\font]])
      local n = #locations[word]
      if n > includeThreshold then
         print(string.format('Dropping word %s (occurs %d times)', word, n))
      elseif unicode.utf8.len(word) < minLength then
         print(string.format('Dropping word %s (its length %d is less than %d)', word, unicode.utf8.len(word), minLength))
      elseif otherExclusions[word] ~= nil then
         print(string.format('Dropping word %s (it was specified as an exclusion)', word))
      else
         if numPrinted > breakThreshold then
            tex.print(string.format([[\par\underline{\textbf{%s}}\par]], word))
            numPrinted = 0
         else
            tex.print(string.format([[\textbf{%s}]], word))
         end
         numPrinted = numPrinted + n
         for i, v in ipairs(locations[word]) do
            if i > 1 then tex.sprint('; ') end
            tex.sprint(string.format('%s%s:%s', abbrev(v.book), v.chapter, v.verse))
         end
      end
   end
end

-- Abbreviations that aren't just first 3 letters
local knownBooks = {Richter='Ri', Ruth='Rt', Hiob='Hi', Psalmen='Ps', Hohelied='Hld', Klagelieder='Klg',
                    Amos='Am', Zephania='Zef', Matthäus='Mt', Lukas='Lk', Apostelgeschichte='Apg', Philemon='Phm'}
function abbrev(book)
   if knownBooks[book] ~= nil then return knownBooks[book] end
   -- First 3 letters, but if 2nd letter is a space then ignore it.
   if string.sub(book, 2, 2) == ' ' then
      knownBooks[book] = unicode.utf8.sub(book, 1, 1) .. unicode.utf8.sub(book, 3, 4)
   else
      knownBooks[book] = unicode.utf8.sub(book, 1, 3)
   end
   return knownBooks[book]
end

To change what TeX sees, you can change the printConcordance function. It has an options parameter through which you can control various things:

includeThreshold for dropping words that occur too frequently. (For example, “UND” occurs over 42000 times and you surely don't want to index it; what about “DAVID” which occurs 862 times?) The default threshold is set at 300 occurrences.
minLength for “restrictions from minimal word length” as mentioned in the question
otherExclusions for “restrictions to specific words to exclude” as mentioned in the question.

And of course you can edit the function yourself to change the behaviour further. When you compile the above file, the terminal output will tell you which words are being dropped, and why. On my laptop it takes about 5 seconds to generate the concordance, and about 25 seconds to typeset it, so the whole run takes about 30 seconds total.

+1. Very cool! Is there a way to use abbreviations for the various book names? Page 5 of the pdf-file concordance that the OP referenced has a list of abbreviations for the "books". E.g., "1. Mose" -> "1Mo", "Richter" -> "Ri", etc. Could this be implemented as a LuaTeX function? — Mico, Nov 06 '17 at 07:30
@Mico Sure, although in the example PDF I see “Ruth” abbreviated “Rt” for example (and “Hohelied” as “Hld”)… so as the abbreviation seems to be hand-curated rather than mechanically generated, it may be better to put those (hard-coded) abbreviations in a table called abbrev say, and in the place where it says v.book (inside printConcordance, towards the bottom of the file), replace it with abbrev[v.book]. — ShreevatsaR, Nov 06 '17 at 07:35
@Mico Done :-) See abbrev in the updated Lua code. Added a small table for the few exceptions. — ShreevatsaR, Nov 07 '17 at 18:09
Dear Poster, very very nice. Could I ask you for another favour? In the meantime I recognize I want to sort the MAIN ENTRY with Subentries like this: GOTT Erde 1Mo1:1; GOTT Himmel 1Mo1:1; the second word is the first substantive after the main index entry (GOTT everything capital letters) and if there is no substantive following its the first substantive before the main entry (not regarding the first word of each sentence which starts wirh a capital letter and if no other substantive is following just taking the next word like JESUS weinte Joh 11:35. Is it possible for you to post it also? — laminin, Nov 08 '17 at 22:19
@laminin Yes something like that should be possible, though it gets farther and farther away from a clean question and at some point should probably be a new question. :-) I also don't understand what you want: where do you get “GOTT Erde 1Mo1:1” from your sample text? According to the rule you described in the comment, the first verse “… Gott den Himmel…” leads only to one entry “GOTT Himmel 1Mo1:1”. (I'll take a look later; I'm also curious why you added a bounty and didn't come back to notice the answers, thereby “wasting” half your bounty….) — ShreevatsaR, Nov 08 '17 at 23:22
Dear ShreevatsaR, I think it's even more useful for me to generate an index with the word which follows the index word. (E.g.: "Im Anfang schuf Gott Himmel und Erde (1Mo1:1)" should be indexed like this: "GOTT Himmel schuf 1Mo1:1.") I am now working hard to generate an index with EXCEL formulas and Notepad++, but I'd like to come to you back in the next three or four weeks. — laminin, Nov 12 '17 at 14:52
@laminin The example does not match the text you wrote. From the example “GOTT Himmel schuf 1Mo1:1”, one can infer that you want, after every index word, both the next word and the previous word. But when you described what you want, you mentioned only the word that follows. So it's confusing what you want. — ShreevatsaR, Nov 13 '17 at 15:29
Yes you're right. Yes, I want the next word plus the previous word to the index word. I will come back to you soon! — laminin, Nov 13 '17 at 20:30

Michael Kane · Answer 2 · 2017-11-01T06:07:28.157

3

Because you asked for links to other programs, there is a solution, albeit a commercial one. First, create your PDF file using LaTeX. Second, run that file through PDF Index Generator: https://www.pdfindexgenerator.com/. You will be able to modify the list of words so as to remove common words. You will then get a file which you can append to your LaTeX-generated PDF. One caveat is the the generated list will not be in the same format, that is in terms of font and styles, as your LaTeX-generated text. To achieve this you would have to extract the concordance text from the PDF and then append it to your LaTeX-generated text. I have no connection with PDF Generator other than being a user.

A freeware list of common or "stop words" for German and other languages can be found at: https://github.com/Alir3z4/stop-words

edited Nov 01 '17 at 06:07

answered Oct 31 '17 at 19:50

Michael Kane

357

The question was modified so as to preclude commercial solutions after this post. Nevertheless, this is a solution for those who find a commercial alternative acceptable. – Michael Kane Oct 31 '17 at 20:09
Thank you for your suggestion. Will the output look somehow like: http://www.gurt-der-wahrheit.org/files/konkordanz_schlachter_1951_A4.pdf ? – laminin Oct 31 '17 at 20:38
While I have used the program on multiple occasions, I never had to use it for a text as long as the Bible. As I recall the concordance generated was dual-column with a reference to the pages where the word appeared. However, the developer is very responsive and you might ask him if a format similar to the 1951 text could be generated. If you take the results generated and add the raw text to your original LaTeX file you can put them in any format you wish. – Michael Kane Oct 31 '17 at 21:01

How to generate an automatic index (concordance) in a large file?

2 Answers2

Linked