I'm trying to put together a regular expression search that finds any two (or more) words that are within n (e.g., more than 1, less than 5) words of each other. The goal is to search over a prose text, and find unneeded repetitions of words close to each other.
Example: in the following text, the search should identify "package:"
The postman delivered a package, and the package was heavy.
The challenge is that the two words can be any two words, but must be the same two words. I've been trying to figure out a way to work with * or + (I'm fairly new to regular expressions), but of course, wildcards would match every word, so they don't work. Is there any search structure like $1 within n of $1 that would translate to regex?
tr -s '[[:punct:][:space:]]' '\n' < filethat splits a file into words (from http://stackoverflow.com/questions/15501652/how-split-a-file-in-words-in-unix-command-line), and pipe that throughsort -u. Then you could use a script to iterate each word/item from the output into the regular expression above, and print every result that returnsTrue. I'll try and test this tomorrow; should be fairly easy to write a script or plugin that takes care of it. – zoned post meridiem May 01 '14 at 00:35