Why is StringExpression faster than RegularExpression?

Question

Edit: as noted by Albert Retey the performance difference is only seen when sub expression extraction is performed. If this test is used below the timings are similar:

First@Timing[r1 = StringCases[textBig, se];]
First@Timing[r2 = StringCases[textBig, re];]

According to the documentation:

Any symbolic string pattern is first translated to a regular expression. You can see this translation by using the internal StringPattern`PatternConvert function.
StringPattern`PatternConvert["a" | "" ~~ DigitCharacter ..] // InputForm
{"(?ms)a?\\d+", {}, {}, Hold[None]}
The first element returned is the regular expression, while the rest of the elements have to do with conditions, replacement rules, and named patterns.

The regular expression is then compiled by PCRE, and the compiled version is cached for future use when the same pattern appears again. The translation from symbolic string pattern to regular expression only happens once.

Based on this I would expect a StringExpression and the regular expression produced by PatternConvert to perform similarly, but they do not. Taking an example from this recent question please observe:

se = Shortest["(ICD-9-CM " ~~ code__ ~~ ")"];
re = First @ StringPattern`PatternConvert[se] // RegularExpression

RegularExpression["(?ms)\\(ICD-9-CM (.+?)\\)"]

text1 = "  A  Vitamin D Deficiency (ICD-9-CM 268.9) (ICD-9-CM 268.9) 09/11/2015  01 ";
textBig = StringJoin @ ConstantArray[text1, 1*^6];

First@Timing[r1 = StringCases[textBig, se :> code];]
First@Timing[r2 = StringCases[textBig, re :> "$1"];]

r1 === r2

0.718
1.903
True

Why is using the StringExpression more than twice as fast as the RegularExpression?
Is there a way to make the RegularExpression matching run just as quickly?

I think you have v7. Just mentioning that it's the same in 9 too. — Szabolcs, May 22 '13 at 15:59
The answer might be in StringPattern`PatternConvert[re]... which is not the same as re. — Szabolcs, May 22 '13 at 16:00
@Szabolcs It seems that has an effect, but not of the same magnitude. For example, nesting it ten times: re2 = Nest[ RegularExpression @ First @ StringPattern`PatternConvert[#] &, se, 10 ] yields a timing of 2.137 -- a minor increase, compared to the se/re difference. — Mr.Wizard, May 22 '13 at 16:06
Do you know what the ?: means? (I don't.) Maybe this is something worth mentioning to support then? — Szabolcs, May 22 '13 at 16:11
I tried time pcregrep --buffer-size=100000000 '(?ms)(?:(?ms)\(ICD-9-CM (.+?)\))' test.txt >/dev/null with pcregrep 8.32. This doesn't replace, it only matches, so it may not be correct. It takes 0.09 s here. — Szabolcs, May 22 '13 at 16:17
@Szabolcs ?: means clustering but not capturing, so you can group regexes within (?:) but doesn't make backreferences as () does. — Stefan, May 22 '13 at 16:26
this helps a lot! First@Timing[ r2 == StringCases[textBig, "" ~~ x : RegularExpression["(?ms)\(ICD-9-CM (.+?)\)"] -> x];] — Stefan, May 22 '13 at 17:05
it seems like that StringExpression does something magical with RegularExpression. Presumably it does compile/cache the pattern — Stefan, May 22 '13 at 17:06
@Stefan that is not quite the same operation as it is matching the entire (ICD-9-CM 268.9) section rather than just the number. Nevertheless there does seem to be something going on. — Mr.Wizard, May 22 '13 at 17:32
@Mr.Wizard during tracing i found that out as well :(...i'm working on it how to get captured expressions... — Stefan, May 22 '13 at 17:40
Could StringPattern be a subset of regular expressions and therefore allow better optimization? Just a guess and I can't think of a way to test it... — SEngstrom, May 22 '13 at 18:02
@SEngstrom Actually I think it's the other way around; StringExpression can have programmatic conditions, etc., which is why the conversion with PatternConvert has additional fields (besides the regular expression). Of course in this simple case the additional fields are not used, but you can see it in Shortest["(ICD-9-CM " ~~ code__ ~~ ")" /; StringLength[code] < 7] // StringPattern`PatternConvert or Shortest["(ICD-9-CM " ~~ code__?LetterQ ~~ ")"] // StringPattern`PatternConvert — Mr.Wizard, May 22 '13 at 18:07
I have no time to investigate that more closely, but I think the difference could well come from how the resubstituting of the matches is implemented. If you try the same thing without such substitutions, then the runtime differences are marginal, e.g.: StringCases[textBig, RegularExpression["(?ms)\\(ICD-9-CM .+?\\)"]] vs. StringCases[textBig, Shortest["(ICD-9-CM " ~~ __ ~~ ")"]] seem to be equally fast (or probably slow when compared to pcregrep :-) — Albert Retey, May 22 '13 at 19:56
@Mr.Wizard after my first approach, which showed nearly similar performance behaviour and after Albert's further observation, I came up with the following:
First@Timing[r3=StringCases[textBig,"(ICD-9-CM "~~x:RegularExpression[".+?"]~~")"->x];]

Which shows similar performance behaviour.

So we may conclude that you may choose to use RegularExpression in Mma, but you should avoid capturing/substitution if you want to be fast, plus capturing inside a StringExpression context does not work at all. — Stefan, May 23 '13 at 07:53

score 13 · Answer 1 · edited Mar 16 '21 at 21:31

Since we can not see the source code of Mathematica, we don't know the detailed algorithm Mathematica uses to do string pattern searching.

But in most other languages, they use KMP algorithm to do explicit string matching. KMP is in fact a very compact design of the DFA pattern matching algorithm. You can find a comparison here. You can see that the construction time complexity of KMP and DFA are O(m) and Θ(m |Σ|), where m is the length of the searching string and |Σ| is the size of the alphabets. But the matching phrase are all O(n) where n is the length of string to search. You can find a detailed code describe the DFA pattern search in Sedgewick's Algorithm 4th Edition. In Chapter 5 String, you can find the author provide a DFA simulation algorithm. The clever idea to build the prefix array is to match the string to search to itself and build the prefix array, instead of building the dfa[][] two dimensional array like in Sedgewick's book. But the essence of the two algorithm are the same. You can just think KMP as an optimization of the DFA pattern matching.

While Regular Expression is equivalent to Nondeterministic finite automaton(NFA). NFA is more powerful than DFA in computing, which means that all DFA can be expressed in NFA, while the opposite is not true. More power sometimes means more expensive to achieve. Although according to the Wikipedia of Regular Expression#Running Time section, there are algorithm that when the NFA is constructed, the pattern matching procedure can be done in O(n) . But in real life, the running time can usually be multiply by an constant factor. Like in this case, the regular expression code is two times slower than the string expression one. This may be because there is more memory used to construct the NFA, which eventually causes more catch mismatches, or the more complex codes are compiled to more machine instructions while executing.

Like the blog post here also indicate a slower factor of regular expression compare to string matching in C#, but we can see that both algorithm running in linear time complexity, just with different slope. But there are also some algorithm of regular expression (like in Perl), which runs exponentially. You can check it here.

Why is StringExpression faster than RegularExpression?

1 Answers1

Linked