Regular expression that finds same words within 'n' words of each other

Question

I'm trying to put together a regular expression search that finds any two (or more) words that are within n (e.g., more than 1, less than 5) words of each other. The goal is to search over a prose text, and find unneeded repetitions of words close to each other.

Example: in the following text, the search should identify "package:"

The postman delivered a package, and the package was heavy.

The challenge is that the two words can be any two words, but must be the same two words. I've been trying to figure out a way to work with * or + (I'm fairly new to regular expressions), but of course, wildcards would match every word, so they don't work. Is there any search structure like $1 within n of $1 that would translate to regex?

score 2 · Accepted Answer · answered Apr 30 '14 at 21:53

2

I don't think a regex is what you need here – you cannot express that, unless you know the words before.

So, I guess you could go ahead and parse every word from the text (e.g. sorting, then removing duplicates). Then, you run the following regular expression, for every word found (here, the word is foo):

\bfoo\W+(?:\w+\W+){1,5}?foo\b

Here, \b is a word boundary. Then you match the actual word. After that, \W is any non-word character, multiple times. Now you start a group (surrounded by ()), which can occur 1 to 5 times ({1,5}). The group will not be captured (?:).

See an example in action here.

answered Apr 30 '14 at 21:53

slhck

228,104

Fantastic. Let me make sure I understand this correctly. The idea would be to use something like tr -s '[[:punct:][:space:]]' '\n' < file that splits a file into words (from http://stackoverflow.com/questions/15501652/how-split-a-file-in-words-in-unix-command-line), and pipe that through sort -u. Then you could use a script to iterate each word/item from the output into the regular expression above, and print every result that returns True. I'll try and test this tomorrow; should be fairly easy to write a script or plugin that takes care of it. – zoned post meridiem May 01 '14 at 00:35
Something like that, yeah! Although of course it would also be necessary to check case-insensitively. But you could do that by lower casing the input text. – slhck May 01 '14 at 07:03

score 0 · Answer 2 · answered Nov 21 '20 at 16:33

From https://en.wikipedia.org/wiki/Regular_expression

Many features found in virtually all modern regular expression libraries provide an expressive power that exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory. The pattern for these strings is (.+)\1.

For this reason the pattern given in slhck's answer can be modified to be much more flexible:
\b(\w+)\W+(?:\w+\W+){1,5}?\1\b
resulting in
full match: package, and the package
group 1: package
for the test string given in the question.
Tested with https://regex101.com/ with PCRE set as a flavour.

Regular expression that finds same words within 'n' words of each other

2 Answers2