Edit: as noted by Albert Retey the performance difference is only seen when sub expression extraction is performed. If this test is used below the timings are similar:
First@Timing[r1 = StringCases[textBig, se];]
First@Timing[r2 = StringCases[textBig, re];]
According to the documentation:
Any symbolic string pattern is first translated to a regular expression. You can see this translation by using the internal
StringPattern`PatternConvertfunction.StringPattern`PatternConvert["a" | "" ~~ DigitCharacter ..] // InputForm{"(?ms)a?\\d+", {}, {}, Hold[None]}The first element returned is the regular expression, while the rest of the elements have to do with conditions, replacement rules, and named patterns.
The regular expression is then compiled by PCRE, and the compiled version is cached for future use when the same pattern appears again. The translation from symbolic string pattern to regular expression only happens once.
Based on this I would expect a StringExpression and the regular expression produced by PatternConvert to perform similarly, but they do not. Taking an example from this recent question please observe:
se = Shortest["(ICD-9-CM " ~~ code__ ~~ ")"];
re = First @ StringPattern`PatternConvert[se] // RegularExpression
RegularExpression["(?ms)\\(ICD-9-CM (.+?)\\)"]
text1 = " A Vitamin D Deficiency (ICD-9-CM 268.9) (ICD-9-CM 268.9) 09/11/2015 01 ";
textBig = StringJoin @ ConstantArray[text1, 1*^6];
First@Timing[r1 = StringCases[textBig, se :> code];]
First@Timing[r2 = StringCases[textBig, re :> "$1"];]
r1 === r2
0.718 1.903 True
- Why is using the
StringExpressionmore than twice as fast as theRegularExpression? - Is there a way to make the
RegularExpressionmatching run just as quickly?
StringPattern`PatternConvert[re]... which is not the same asre. – Szabolcs May 22 '13 at 16:00re2 = Nest[ RegularExpression @ First @ StringPattern`PatternConvert[#] &, se, 10 ]yields a timing of2.137-- a minor increase, compared to these/redifference. – Mr.Wizard May 22 '13 at 16:06?:means? (I don't.) Maybe this is something worth mentioning to support then? – Szabolcs May 22 '13 at 16:11time pcregrep --buffer-size=100000000 '(?ms)(?:(?ms)\(ICD-9-CM (.+?)\))' test.txt >/dev/nullwith pcregrep 8.32. This doesn't replace, it only matches, so it may not be correct. It takes 0.09 s here. – Szabolcs May 22 '13 at 16:17(ICD-9-CM 268.9)section rather than just the number. Nevertheless there does seem to be something going on. – Mr.Wizard May 22 '13 at 17:32StringExpressioncan have programmatic conditions, etc., which is why the conversion withPatternConverthas additional fields (besides the regular expression). Of course in this simple case the additional fields are not used, but you can see it inShortest["(ICD-9-CM " ~~ code__ ~~ ")" /; StringLength[code] < 7] // StringPattern`PatternConvertorShortest["(ICD-9-CM " ~~ code__?LetterQ ~~ ")"] // StringPattern`PatternConvert– Mr.Wizard May 22 '13 at 18:07StringCases[textBig, RegularExpression["(?ms)\\(ICD-9-CM .+?\\)"]]vs.StringCases[textBig, Shortest["(ICD-9-CM " ~~ __ ~~ ")"]]seem to be equally fast (or probably slow when compared to pcregrep :-) – Albert Retey May 22 '13 at 19:56First@Timing[r3=StringCases[textBig,"(ICD-9-CM "
x:RegularExpression[".+?"]")"->x];]Which shows similar performance behaviour.
So we may conclude that you may choose to use RegularExpression in Mma, but you should avoid capturing/substitution if you want to be fast, plus capturing inside a StringExpression context does not work at all.
– Stefan May 23 '13 at 07:53