Reference data set for benchmarking string comparison algorithms

Question

Recently, I have implemented an extension to the classical Levenshtein-Damerau and Gotoh (aka "affine gaps") string matching algorithms. To evaluate this modification, however, I need a large (in terms of number of examples) and broad (in terms of word meanings) data set with word pairs in the form of correct spelling, modified variations. The modifications might result from abbreviation, spelling errors or the like but not from artificial contamination.

In other words: I'd like to compare my string matching algorithms with the current state of the art by using correctly labelled real world data. Do you know of any sources, where I can obtain not only the polluted, misspelled etc. data but, along with it, correctly spelled variants of the words?

I have already looked into the data sets mentioned on pp. 65 in 'An Introduction to Duplicate Detection' by Felix Naumann and Melanie Herschel (i.e. Cora Citation Matching, an enriched freeDB data set from Hasso-Plattner institute, and data from scientific publications that has been manually clustered at the University of Trier) but I'd like to apply my tests to as many different data sets as possible.

What about an answer that got data from social media (where there are many typos)? — philshem, Apr 20 '15 at 13:09
Thanks for your comment! Could you please elaborate on that? How could I obtain the correct spelling? — wehnsdaefflae, Apr 20 '15 at 13:14
You can detect correct by a huge peak of frequency, compared to incorrect (small edit distance to correct). — philshem, Apr 20 '15 at 13:17
but how would i obtain different variants of the same word to determine any frequency in the first place? i would need to use a string comparison method to identify differently spelled words that might refer to the same concept. that, in turn, begs the question which method gives the best overall result. in the end, i guess, it comes down to manually selected examples. but thanks for your idea! — wehnsdaefflae, Apr 20 '15 at 13:25

Reference data set for benchmarking string comparison algorithms

0 Answers0