As others said, this cannot lead to good index, because not every term usage is important, this cannot find concepts, there will be no cross references, you cannot deal with synonymy, homonyms etc.
But if you really want, there is simple script in lua, autoindex.lua:
#!/usr/bin/env texlua
local indexterms = arg[1]
local file = io.open(indexterms,"r")
if not file then
print("Cannot load index terms: ".. inputfile)
os.exit()
end
local terms = {}
for line in file:lines() do
terms[#terms + 1] = line
end
file:close()
local text = io.read("*all")
local page = 1
local words = {}
-- Process pages
for t in text:gmatch("[^\f]*") do
--tokenize words
--add more characters which can't be part of words
for x in t:gmatch("([^%s%.,!%?%(%)%-i@%$]+)") do
local x = string.lower(x) -- normalize strings. note that this doesn't handle unicode
local w = words[x] or {}
w[page] = true
words[x] = w
end
page = page + 1
end
for _, term in pairs(terms) do
local match = words[term] or {}
for page,_ in pairs(match) do
print('\indexentry{'..term..'}{'..page..'}')
end
end
You must first convert your pdf file to text with pdftotext utility:
pdftotext filename.pdf outputfile.txt
this will preserve page breaks. Then call this script like:
texlua autoindex.lua filewithterms < outputfile.txt > indexfile.idx
This will write entries in standard makeindex format to indexfile.idx:
indexentry{hello}{13}
indexentry{hello}{9}
indexentry{hello}{7}
indexentry{world}{7}
indexentry{world}{3}
indexentry{world}{13}
indexentry{world}{9}
indexentry{world}{5}
you can make index with makeindex or xindy then.
.txtfile also contain the whereabouts of the keywords? By including the pages viapdfpagesLaTeX is technically not aware of the "source code" of these particular pages, so there seems to be no easy way to add index markers to certain words. Seeing that your pdfs are OCR'd you might consider using the text file output by the OCR programme and either try to use this text to produce a new TeX file or run some kind of script to mark the position the keyword is on in the file. A genuine LaTeX-only solution might be quite hard to come by. – moewe Oct 02 '13 at 14:48