What is a regex which matches exactly m to n characters but not more?

Question

StringCases["abcdefg hi jkl mn opq rstuv w xy z ", RegularExpression["[a-z]{1,3}"]]

returns

{"abc", "def", "g", "hi", "jkl", "mn", "opq", "rst", "uv", "w", "xy", "z"}

so it actually chops the first match of RegularExpression["[a-z]*"] into pieces of up to 3 characters (the max of the quantifier {1,3}).

I want a regular expression which does not do that but rather considers the findings of RegularExpression["[a-z]*"] which mismatch the quantifier {1,3} as mismatches. It should be greedy before applying specialized quantifiers.

There seems to be a misunderstanding. "[a-z]{1,3}" does not first do "[a-z]*" but it tries to match 1 to 3 characters and this is exactly what you observe. Further your question is not clear to me, what do you mean by: "which mismatch the quantifier {1,3} as mismatches". Do you mean you want to match a character string that is longer than 3 characters? — Daniel Huber, May 24 '21 at 09:47

score 9 · Accepted Answer · answered May 24 '21 at 10:17

You must say what shall come before and after the 1–3 characters, for example a word boundary \\b:

StringCases["abcdefg hi jkl mn opq rstuv w xy z ", 
            RegularExpression["\\b[a-z]{1,3}\\b"]]
(*    {"hi", "jkl", "mn", "opq", "w", "xy", "z"}    *)

Alternatively, split the string and then select the substrings you want:

Select[StringSplit["abcdefg hi jkl mn opq rstuv w xy z "], 
       1 <= StringLength[#] <= 3 &]
(*    {"hi", "jkl", "mn", "opq", "w", "xy", "z"}    *)

score 6 · Answer 2 · answered May 24 '21 at 15:54

We could test for the negation of the pattern ("mismatches") by using negative look-behind and look-ahead assertions. They can be used to discard any matches of [a-z]{1,3} that are preceded or followed a character that would also match [a-z]:

matches = StringCases @ RegularExpression @
  "(?<![a-z])[a-z]{1,3}(?![a-z])";

So then:

"abcdefg hi jkl mn opq rstuv w xy z " // matches
(* {hi,jkl,mn,opq,w,xy,z} *)
"abcdefg_hi_jkl_mn_opq_rstuv_w_xy_z_" // matches
(* {hi,jkl,mn,opq,w,xy,z} *)

If we would rather not repeat the target pattern three times, we could use named groups and backreferences (although for a short pattern like [a-z] this is probably overkill):

matches2 = StringCases @ RegularExpression @
  "(?<!(?<p>[a-z]))(?&p){1,3}(?!(?&p))";
"abcdefg hi jkl mn opq rstuv w xy z " // matches2
(* {hi,jkl,mn,opq,w,xy,z} *)
"abcdefg_hi_jkl_mn_opq_rstuv_w_xy_z_" // matches2
(* {hi,jkl,mn,opq,w,xy,z} *)

All of this syntax is described by the PCRE pattern documentation. The WL documentation page Working With String Patterns states that PCRE is the library used to implement regular expressions.

Yes, your look-behind & look-ahead are much better (and versatile) than what I wrote. — Roman, May 24 '21 at 17:26
@Roman Thanks, and I quite like your solutions as well. Indeed, should the OP's circumstances permit I think your word boundary solution is superior on the grounds of simplicity and fewer hieroglyphics :) I was originally going to add my answer as a comment on yours but (as site regulars will have observed) I was never going to fit what I wanted to say into a comment :D — WReach, May 24 '21 at 18:28

What is a regex which matches exactly m to n characters but not more?

2 Answers2