Recently, I have implemented an extension to the classical Levenshtein-Damerau and Gotoh (aka "affine gaps") string matching algorithms. To evaluate this modification, however, I need a large (in terms of number of examples) and broad (in terms of word meanings) data set with word pairs in the form of correct spelling, modified variations. The modifications might result from abbreviation, spelling errors or the like but not from artificial contamination.
In other words: I'd like to compare my string matching algorithms with the current state of the art by using correctly labelled real world data. Do you know of any sources, where I can obtain not only the polluted, misspelled etc. data but, along with it, correctly spelled variants of the words?
I have already looked into the data sets mentioned on pp. 65 in 'An Introduction to Duplicate Detection' by Felix Naumann and Melanie Herschel (i.e. Cora Citation Matching, an enriched freeDB data set from Hasso-Plattner institute, and data from scientific publications that has been manually clustered at the University of Trier) but I'd like to apply my tests to as many different data sets as possible.