5

I want to extract url address of all css files inside HTML source. First output is as expected. But why second and third outputs are different? Notice that only "href" was removed from beginning of the first searching string and then "href=", but Shortest[x__] should remain same, so I would expect all outputs to be the same. What I am doing wrong?

str="\"/><link rel=\"stylesheet\" type=\"text/css\" href=\"/some/path/to/css/name.min.css";
StringCases[str,"href=\""~~Shortest[x__]~~y:".css"->x~~y]
StringCases[str,"=\""~~Shortest[x__]~~y:".css"->x~~y]
StringCases[str,"\""~~Shortest[x__]~~y:".css"->x~~y]
Clear[str]

(* {"/some/path/to/css/name.min.css"} ) ( {"stylesheet&quot; type=&quot;text/css&quot; href=&quot;/some/path/to/css/name.min.css"} ) ( {"/><link rel=&quot;stylesheet&quot; type=&quot;text/css&quot; href=&quot;/some/path/to/css/name.min.css"} *)

Update:

First I thought it has something to do with escape characters used inside strings but here is a simple example:

StringCases["A---A--A____B-A_B-A---A______B---AAAAB","A"~~Shortest[x__]~~"B"->x]

(* {"---A--A____","_","---A______","AAA"} *)

But I believe correct result should be:

(* {"____","_","______"} *)
Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
azerbajdzan
  • 15,863
  • 1
  • 16
  • 48

2 Answers2

7

It is just a consequence of how the lazy quantifier in regular expression works. Online test gives the same result.

You should understand that string expressions are first converted to regular expressions by Mathematica. You can see the result with StringPattern`PatternConvert:

StringPattern`PatternConvert[#][[1]] & /@ {"href=\"" ~~ Shortest[x__] ~~ y : ".css", 
  "=\"" ~~ Shortest[x__] ~~ y : ".css", "\"" ~~ Shortest[x__] ~~ y : ".css"}
{"(?ms)href=\"(.+?)(\\.css)", "(?ms)=\"(.+?)(\\.css)", "(?ms)\"(.+?)(\\.css)"}

Hence you shouln't be fooled by the name Shortest: it has no relation to Mathematica's own patter matcher's Shortest, which behaves differently.

Further reading:


UPDATE

On this page, several techniques to overcome this feature of the lazy quantifier are provided, including the most general (but not the most efficient) Tempered Greedy Token Solution. It can be applied as follows for making an equivalent of what could be called a true shortest BlankNullSequence string expression*:

Clear[shortest]
shortest[start_, end_, "IncludeBoundaries" -> True] := 
  RegularExpression[
   StringTemplate["`START`(?:(?!`START`)(?!`END`).)*`END`"][<|"START" -> start, 
     "END" -> end|>]];
shortest[start_, end_] := shortest[start, end, "IncludeBoundaries" -> True]
shortest[start_, end_, "IncludeBoundaries" -> False] := 
  RegularExpression[
   StringTemplate["(?<=`START`)(?:(?!`START`)(?!`END`).)*(?=`END`)"][<|"START" -> start, 
     "END" -> end|>]];

Testing:

front = "A";
back = "B";
str = "A---A--A____B-A_B-A---A______B---AAAAB";
StringCases[str, shortest[front, back, "IncludeBoundaries" -> False]]
{"____", "_", "______", ""}
front = "href=\"";
back = "\\.css";
str = "\"/><link rel=\"stylesheet\" type=\"text/css\" \
href=\"/some/path/to/css/name.min.css";
StringCases[str, shortest[front, back, "IncludeBoundaries" -> False]]
front = "=\"";
StringCases[str, shortest[front, back, "IncludeBoundaries" -> False]]
front = "\"";
StringCases[str, shortest[front, back, "IncludeBoundaries" -> False]]
{"/some/path/to/css/name.min"}
{"/some/path/to/css/name.min"}
{"/some/path/to/css/name.min"}

*As the OP shows in the comments, this method fails miserably in more complicated cases:

front = "tomato";
back = "iconic";
str = "gffghtomatomato12345iconiconictomatomatoiconiconic";
StringCases[str, shortest[front, back, "IncludeBoundaries" -> False]]
{"mato12345", "mato", ""}

This result is wrong. The expected result is {"12345",""}.

Here is another version which gives the desired result:

Clear[shortest2]
shortest2[str_, start_, end_] := 
  StringCases[str, 
   RegularExpression[
     StringTemplate["(?!.{1,`len`}`START`)`START`((?:(?!`START`)(?!`END`).)*)`END`"][<|
       "len" -> StringLength[start], "START" -> start, "END" -> end|>]] -> "$1"];

front = "tomato"; back = "iconic"; str = "gffghtomatomato12345iconiconictomatomatoiconiconic"; shortest2[str, front, back]

{"12345", ""}

However, in some special cases this method also fails:

front = "NotEnd";
back = "End";
str = "NotEndNotEnd1234NotEnd";
shortest2[str, front, back]
{}

Hence the approach suggested by the OP should be preferred.


UPDATE 2

It seems that I managed to find a really universal solution through regular expressions:

Clear[ShortestStringBetween]
Options[ShortestStringBetween] = {"IncludeBoundaries" -> False, 
   "BoundaryOverlaps" -> False};
ShortestStringBetween[str_String, start_String, end_String, OptionsPattern[]] :=
  Module[{bInclude = OptionValue["IncludeBoundaries"],
    bOvelap = OptionValue["BoundaryOverlaps"]},
   Which[
    bInclude && Not[bOvelap],
    StringCases[str, RegularExpression[
      StringTemplate["`START`(?:(?!`END`).(?<!`START`))*`END`"][
       <|"START" -> start, "END" -> end|>]]],
    Not[bInclude] && Not[bOvelap],
    StringCases[str, RegularExpression[
       StringTemplate["`START`((?:(?!`END`).(?<!`START`))*)`END`"][
        <|"START" -> start, "END" -> end|>]] -> "$1"],
    Not[bInclude] && bOvelap,
    StringCases[str, RegularExpression[
      StringTemplate["(?<=`START`)(?:(?!`END`).(?<!`START`))*(?=`END`)"][
       <|"START" -> start, "END" -> end|>]]],
    bInclude && bOvelap,
    StringCases[str, match : RegularExpression[
        StringTemplate["(?<=`START`)(?:(?!`END`).(?<!`START`))*(?=`END`)"][
         <|"START" -> start, "END" -> end|>]] :> StringJoin[start, match, end]]
    ]];

Note that the start and end parameters are directly inserted into RegularExpression and therefore must be regular expressions in the Mathematica format. And since PCRE (on which RegularExpression is based) doesn't support infinite repetition within a lookbehind, the start parameter must be a fixed-length regexp or contain alternations of different but pre-determined lengths (for example, "cat|raccoon"). The end parameter has no such restriction. But I haven't tested how this implementation behaves with non-fixed length parameters.

It works correctly in the all test cases:

front = "tomato";
back = "iconic";
str = "gffghtomatomato12345iconiconictomatomatoiconiconic";
ShortestStringBetween[str, front, back]
{"12345", ""}
front = "NotEnd";
back = "End";
str = "NotEndNotEnd1234NotEnd";
ShortestStringBetween[str, front, back]
ShortestStringBetween[str, front, back, "BoundaryOverlaps" -> True]
{"Not"}
{"Not", "1234Not"}
Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
  • So simply it is a bug anyway. I do not think it worked the same way in older versions. – azerbajdzan Aug 26 '22 at 14:02
  • @azerbajdzan It isn't a bug, and it works the same way in version 8.0.4 (I just tested). – Alexey Popkov Aug 26 '22 at 14:08
  • How it is not a bug? What is the purpose of Shortest when it does not represent the shortest possible string then? – azerbajdzan Aug 26 '22 at 14:11
  • 1
    @azerbajdzan It represents the lazy quantifier from regular expressions in the Wolfram Language. I agree that the decision to call it Shortest is dubious. Please read this answer for a discussion. – Alexey Popkov Aug 26 '22 at 14:14
  • I tried to read, but I can not follow. – azerbajdzan Aug 26 '22 at 14:40
  • @azerbajdzan I updated the answer with a RegularExpression-based solution. – Alexey Popkov Aug 28 '22 at 04:05
  • 1
    I like my version more, it is less complicated and gives the shortest string without overlaps in all circumstances. My code and your code are not equivalent. My version returns: shortest["gffghtomatomato12345iconiconictomatomatoiconiconic","tomato","iconic"]=={"12345",""} while yours StringCases["gffghtomatomato12345iconiconictomatomatoiconiconic", shortest["tomato", "iconic", "IncludeBoundaries" -> False]]=={"mato12345","mato",""}. Your code did not even find "12345" and on the other hand returns "mato" which is longer then "". "mato" would be valid only if overlaps are allowed. – azerbajdzan Aug 28 '22 at 08:02
  • 1
    ...but if overlaps were allowed then beside "mato" there would be missing "icon" in your output. Furthermore - if overlaps were allowed then "shortest" loses its meaning because then all occurrences of start~~___~~end would be valid. – azerbajdzan Aug 28 '22 at 08:18
  • @azerbajdzan I see. Thank you for a good example. – Alexey Popkov Aug 28 '22 at 08:58
  • 1
    @azerbajdzan I updated the answer with another version which gives correct result in such cases. – Alexey Popkov Aug 28 '22 at 10:07
  • 1
    @azerbajdzan It seems that I still managed to find a really universal solution through regular expressions. Please see the "UPDATE 2" section. – Alexey Popkov Aug 29 '22 at 04:47
  • Great. Now I know why they choose to use the most simplest way of definition of Shortest. Because it is hard to test the "real shortest" in all circumstances of different possible nested strings. You can never be sure whether you overlooked some complexly nested string for which it might not work. – azerbajdzan Aug 29 '22 at 09:23
1

This finds shortest strings inside str that are between strings front and back (including empty string).

shortest[str_,front_,back_]:=Module[{p1,p2,p},
p1={#,1}&/@StringPosition[str,front][[All,2]];
p2={#,2}&/@StringPosition[str,back][[All,1]];
p={1,-1}+#&/@SequenceCases[Sort[Join[p1,p2]],{{_,1},{_,2}}][[All,All,1]];
StringTake[str,p]
]

front="A";
back="B";
str="A---A--A____B-A_B-A---A______B---AAAAB";
shortest[str,front,back]
Clear[front,back,str]

(* {"____", "_", "______", ""} *)


front="href=\"";
back=".css";
str="\"/><link rel=\"stylesheet\" type=\"text/css\" href=\"/some/path/to/css/name.min.css";
shortest[str,front,back]
front="=\"";
shortest[str,front,back]
front="\"";
shortest[str,front,back]
Clear[front,back,str]

(* {"/some/path/to/css/name.min"} ) ( {"/some/path/to/css/name.min"} ) ( {"/some/path/to/css/name.min"} *)

azerbajdzan
  • 15,863
  • 1
  • 16
  • 48