CSS Selectors for Symbolic XML

Question

Symbolic XML is a convenient way of managing XML in Mathematica, at least for casual use or small tasks. Here and elsewhere you can find examples of XML handling in Mathematica, including XSLT/XPath emulations. In my opinion, however, if you must perform frequent searches, content extraction, or deep transformations, e.g., Web scraping, Symbolic XML quickly becames clumsy and cumbersome to work with. If you have ever faced any kind of web programming, you're probably aware of the convenient CSS Selectors, the more common choice (being the other XPath) to tackle the aforementioned kind of problems.

I'm curious, are there implementations of CSS Selector (and/or XPath) engines in Mathematica? If not, would be feasible/sensible to implement like capabilities in Mathematica? Or is it better to stick with ad-hoc queries and follow the examples approach?

As elegant code is better than thousand opinions, we could illustrate the answers with this toy, greatly simplified, CSS Selector engine:

       Pattern | Matches                                                              | Selector
---------------| -------------------------------------------------------------------- | ----------------
             * | any element.                                                         | Universal
             E | any E element (i.e., an element of type E).                          | Type
     E.warning | any E element whose class is "warning".                              | Class
        E#myid | any E element with ID equal to "myid".                               | Id
           E F | any F element that is a descendant of an E element.                  | Descendant
         E > F | any F element that is a child of an element E.                       | Child
         E + F | any F element immediately preceded by a sibling element E.           | Adjacent Sibling
         E ~ F | any F element preceded by an E element                               | General sibling
        E[foo] | any E element with the "foo" attribute set (whatever the value).     | Attribute
  E[foo="bar"] | any E element whose "foo" attribute value is exactly equal to "bar". | Attribute

A selector is a chain of one or more simple selectors separated by combinators.
Combinators are: white space, ">",  "+", and "~". White space may appear between a combinator and the simple selectors around it.
A simple selector is either a type selector or universal selector followed immediately by zero or more attribute selectors, ID selectors, or class selectors, in any order.
The universal selector may be omitted if followed by other simple selector components.
The simple selector matches if all of its components match.

Examples:
"*div" or "div", all elements with tag "div";
"div *", all descendants at any level of all div elements;
".aClass+[attr=val]", direct descendants with attr equals val of elements with class aClass.

If you need a formal grammar, take a look to the W3C spec, but for this illustration I think that we can prescind of error handling or stringent processing.

Then, given context, any expression containing XMLElements, and selector, a string compliant with the toy CSS grammar, implement positionSelector[context, selector] to return the positions (same format of Position built-in) of XMLElements matching selector:

xml = ImportString["<!doctype html><html><h1>Animals</h1>", "XMLObject"];
positionSelector[xml, "body"]
positionSelector[xml, "*"]
Extract[xml, positionSelector[xml, "h1"]]

(* ==> {{2, 3, 1}} *)
(* ==> {{2, 3, 1, 3, 1}, {2, 3, 1}, {2}} *)
(* ==> {XMLElement["h1", {}, {"Animals"}]} *)

xml = ImportString["<body><div id='abc' class='aClass'>Eh... <b id='b1'><div><span>What's </span>up, <span id='def'>doc?</span>up, <span id='def'>doc?</span></div></b></div><div class='bClass' attr1='val1' attr2='val2'>Legen<div class='cClass'>--<div class='dClass'>Wait for it</div>--</div>Dary!</div><div id='abc'>No. <div attr1='val3'>I <b id='b2'><span>am</span></b> your</div><div> father.</div></div></body>", "XML", "NormalizeWhitespace" -> False];
result = positionSelector[xml, "div>span"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 2, 3, 1, 3, 1}, {2, 3, 1, 3, 2, 3, 1, 3, 3}, {2, 3, 1, 3, 2, 3, 1, 3, 5}} *)
(* ==>
XMLElement["span", {}, {"What's "}], 
XMLElement["span", {"id" -> "def"}, {"doc?"}], 
XMLElement["span", {"id" -> "def"}, {"doc?"}]}
*)

result = positionSelector[xml, "[class]>div"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 2, 3, 2, 3, 2}, {2, 3, 2, 3, 2}} *)
(* ==> 
XMLElement["div", {"class" -> "dClass"}, {"Wait for it"}], 
XMLElement["div", {"class" -> "cClass"}, {"--", 
XMLElement["div", {"class" -> "dClass"}, {"Wait for it"}], "--"}]}]
*)

xml = ImportString["<!doctype html><html>
   <h1>Animals</h1>
   <h2 class='animal dog border-collie' id='lillith'><span class='christian Name'>Lillith&gt;</span></h2>
   <h2 class='animal dog mutt' id='maggie'><span class='christian Name'>Maggie</span><span class='nick Name'>Fatty</span></h2>
   </html>", 
"XMLObject"];

result = positionSelector[xml, "h2+.dog"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 5}} *)
(* ==> XMLElement["h2", {"class" -> "animal dog mutt", "id" -> "maggie"}, {XMLElement["span", {"class" -> "christian Name"}, {"Maggie"}], XMLElement["span", {"class" -> "nick Name"}, {"Fatty"}]}]} *)

result = positionSelector[xml, "h2+#maggie"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 5}} *)
(* ==> XMLElement["h2", {"class" -> "animal dog mutt", "id" -> "maggie"}, {XMLElement["span", {"class" -> "christian Name"}, {"Maggie"}], XMLElement["span", {"class" -> "nick Name"}, {"Fatty"}]}]} *)

positionSelector[xml, "h1~.dog"]

(* ==> {{2, 3, 1, 3, 3}, {2, 3, 1, 3, 5}} *)

positionSelector[xml, "h1~b"]

(* ==> {} *)

To be clear, I'm not looking for a full-blown implementation, but strategies and idioms of how Mathematica can resolve XML queries, and in which ways is Mathematica better equipped for that than mainstream functional/imperative languages.

Fallible · Answer 1 · 2014-06-05T10:16:33.247

I think Mathematica with its symbolic and pattern-matching capabilites is well suited to tackle parser and tree searching. It's better to check, though. Here are two quick proof-of-concept of the toy engine. One ad-hoc, the other using functional parsers.

It' s readily apparent that every simple selector can be mapped to some form of XMLElement pattern:

"div.aClass[attr='val']" => XMElement["div", {___,"class"->"aClass",___,"attr"->"val", ___},_]

Let's build over this.

Custom solution:

Some patterns and auxiliary functions used throughout the program:

NMSTART = "_" | LetterCharacter;
NMCHAR = "_" | WordCharacter | "-";
IDENT = "-" | NMSTART ~~ NMCHAR ...;
selectorPats = {
   "." ~~ val : IDENT :> (Sow["class" -> c_String /; inAttrQ[c, val]];""),
   "#" ~~ val : NMCHAR .. :> (Sow["id" -> val]; ""),
   "[" ~~ key : IDENT ~~ "=" ~~ "\"" ~~ val : Except["\""] .. ~~ "\"" ~~ "]" :> (Sow[key -> val]; ""),
   "[" ~~ key : IDENT ~~ "=" ~~ "'" ~~ val : Except["'"] .. ~~ "'" ~~ "]" :> (Sow[key -> val]; ""),
   "[" ~~ key : IDENT ~~ "=" ~~ val : IDENT ~~ "]" :> (Sow[key -> val]; ""),
   "[" ~~ key : IDENT ~~ "]" :> (Sow[key -> _]; "")
   };

inAttrQ = #1 === #2 || StringMatchQ[#1,
     {StartOfString, Whitespace, __ ~~ Whitespace} ~~ #2 ~~ {Whitespace, Whitespace ~~ __, EndOfString}
     ] &;

hasAttrsQ[attrs_List, r_List] := MemberQ[attrs, #] & /@ r /. {b__} :> And[b];    

cleanAttr[str_String] :=
 StringReplace[StringTrim@str,
  {StartOfString ~~ Shortest[key__] ~~ "=" ~~ val__ ~~ EndOfString :> StringTrim@key <> "=" <> StringTrim@val}]

toXMLElement translates simpleSelector strings into XMLElement[...] patterns:

toXMLElement[str_String] := 
 Module{attrs = _, tag = _, sel = StringTrim@str},
  StringLength@sel === 0 && Return[$Failed];

  sel = StringReplace[sel, "*" ~~ val : Except["*"] .. :> val];
  sel = StringReplace[sel, StartOfString ~~ type : IDENT :> (tag = type; "")];
  attrs = Rest@Reap[sel = StringReplace[sel, selectorPats]] // Flatten;
  (sel =!= "" && sel =!= "*" && attrs === {}) && Return[$Failed];
  XMLElement[
   tag,
   Switch[attrs,
    {}, _,
    {_}, {___, attrs[[1]], ___},
    _, Condition @@ {a : {__}, hasAttrsQ[a, attrs]}],
   _]
  ]

parseSelector is an ad-hoc lexer/parser to translate the selector into a list of patterns: {pat1, {combinator2, pat2}, {combinator3, pat3}, ...}

parseSelector[str_String] :=
 Module[{f, n = 0},
  Replace[
   GatherBy[
    StringSplit[str, {
      "[" ~~ attr :Except["]"] ... ~~ "]" :> "[" ~~ cleanAttr[attr] ~~ "]",
      WhitespaceCharacter ... ~~ comb : ">" | "~" | "+" ~~ WhitespaceCharacter ... :> f[comb],
      Whitespace -> f[" "]}],
    (Head@# === f && ++n; n) &],
   {{f[a_], b__} :> {a, toXMLElement@StringJoin@b}, {b__} :> toXMLElement@StringJoin@b}, 1]]

Actual interpreter:

positionSelector[ctx_, sel_String] :=
 Fold[positionSelector[ctx, #1, #2] &,
    Position[ctx, First@#], 
    Rest@#
    ] &@parseSelector[sel]

Auxiliary function to delete descendants of nodes in a list. The list is the result of Position, and thus is ordered depth-first, post-order:

deleteDescendants[col_List] := (Reverse@col //. {lft___, aa : {a__}, b : {__}, rght___} /; (MatchQ[b, {a, __}]) :> {lft, aa, rght}) // Reverse

positionSelector[ctx_, curr_List, {" ", sel_XMLElement}] :=
 (With[{ref = Append[#, 3]},
      Join[ref, #] & /@ Position[Extract[ctx, ref], sel]
      ] & /@ deleteDescendants@curr)~Flatten~1

positionSelector[ctx_, curr_List, {">", sel_XMLElement}] :=
 (With[{ref = Append[#, 3]},
      Join[ref, #] & /@ Position[Extract[ctx, ref], sel, {1}]
      ] & /@ curr)~Flatten~1

positionSelector[ctx_, curr_List, {"+", sel_XMLElement}] :=
 (With[{pos = #[[-1]], container = Drop[#, -1]},
      Extract[ctx, container] //
       Cases[
         Position[#, sel, {1}] /. {n_Integer} /; n <= pos :> Sequence[],
         {a_} /; MatchQ[#[[pos + 1 ;; a - 1]], {___String}] :> Append[container, a] 
         ] &
      ] & /@ curr)~Flatten~1

positionSelector[ctx_, curr_List, {"~", sel_XMLElement}] :=
 (With[{pos = #[[-1]], container = Drop[#, -1]},
      Join[container, #] & /@ 
        Position[ctx[[Sequence @@ container]], sel, {1}] /. {n_} /; n <= pos :> Sequence[]
      ] & /@ curr)~Flatten~1

Test:

xml = ImportString["<!doctype html><html><h1>Animals</h1>", "XMLObject"];
positionSelector[xml, "body"]
positionSelector[xml, "*"]
Extract[xml, positionSelector[xml, "h1"]]

(* ==> {{2, 3, 1}} *)
(* ==> {{2, 3, 1, 3, 1}, {2, 3, 1}, {2}} *)
(* ==> {XMLElement["h1", {}, {"Animals"}]} *)

xml = ImportString["<body><div id='abc' class='aClass'>Eh... <b id='b1'><div><span>What's </span>up, <span id='def'>doc?</span>up, <span id='def'>doc?</span></div></b></div><div class='bClass' attr1='val1' attr2='val2'>Legen<div class='cClass'>--<div class='dClass'>Wait for it</div>--</div>Dary!</div><div id='abc'>No. <div attr1='val3'>I <b id='b2'><span>am</span></b> your</div><div> father.</div></div></body>", "XML", "NormalizeWhitespace" -> False];

result = positionSelector[xml, "div>span"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 2, 3, 1, 3, 1}, {2, 3, 1, 3, 2, 3, 1, 3, 3}, {2, 3, 1, 3, 2, 3, 1, 3, 5}} *)
(* ==> ColumnForm[{
XMLElement["span", {}, {"What's "}], 
XMLElement["span", {"id" -> "def"}, {"doc?"}], 
XMLElement["span", {"id" -> "def"}, {"doc?"}]}] *)

result = positionSelector[xml, "[class]>div"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 2, 3, 2, 3, 2}, {2, 3, 2, 3, 2}} *)
(* ==> ColumnForm[{
XMLElement["div", {"class" -> "dClass"}, {"Wait for it"}], 
XMLElement["div", {"class" -> "cClass"}, {"--", 
XMLElement["div", {"class" -> "dClass"}, {"Wait for it"}], "--"}]}] *)

xml = ImportString["<!doctype html><html>
          <h1>Animals</h1>
          <h2 class='animal dog border-collie' id='lillith'><span class='christian Name'>Lillith&gt;</span></h2>
          <h2 class='animal dog mutt' id='maggie'><span class='christian Name'>Maggie</span><span class='nick Name'>Fatty</span></h2></html>", "XMLObject"];

result = positionSelector[xml, "h2+.dog"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 5}} *)
(* ==> ColumnForm[{
XMLElement["h2", {"class" -> "animal dog mutt", "id" -> "maggie"}, {
XMLElement["span", {"class" -> "christian Name"}, {"Maggie"}], 
XMLElement["span", {"class" -> "nick Name"}, {"Fatty"}]}]}] *)

result = positionSelector[xml, "h2+#maggie"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 5}} *)
(* ==> ColumnForm[{
XMLElement["h2", {"class" -> "animal dog mutt", "id" -> "maggie"}, {
XMLElement["span", {"class" -> "christian Name"}, {"Maggie"}], 
XMLElement["span", {"class" -> "nick Name"}, {"Fatty"}]}]}] *)

positionSelector[xml, "h1~.dog"]
positionSelector[xml, "h1~b"]

(* ==> {{2, 3, 1, 3, 3}, {2, 3, 1, 3, 5}} *)
(* ==> {} *)

Functional parsers

A more standard approach would be using functional parsers. Here is the lexer/parser using the FunctionalParser library from Anton Antonov, so you don't have to reinvent the wheel.

First, we generate the parsers using a BNF grammar:

Needs["FunctionalParsers`"]

Options[tokenize] = {QuotedStrings -> True, Quotes -> {"'", "\""}, CollapseWhitespace -> True};

tokenize[str_String, terminals : _List : {}, opts : OptionsPattern[]] :=
 Module[{
   terminalRules = If[TrueQ@OptionValue[CollapseWhitespace], Append[terminals, Whitespace -> " "], terminals],
   quoteRules = # ~~ s : Except[#] ... ~~ # :> QuotedString[s, #] & /@ OptionValue[Quotes]},
  Replace[
     If[TrueQ@OptionValue[QuotedStrings], StringSplit[str, quoteRules], {str}],
     s_String :> (StringSplit[s, terminalRules] /. "" -> Sequence[]),
     1
     ] /. {QuotedString[qs_String, q_: "\""] :> Sequence@{q, qs, q}} // Flatten
  ]

terminalRules = {
   "*" -> "*",
   "." -> ".",
   "#" -> "#",
   WhitespaceCharacter ... ~~ ">" ~~ WhitespaceCharacter ... -> ">",
   WhitespaceCharacter ... ~~ "~" ~~ WhitespaceCharacter ... -> "~",
   WhitespaceCharacter ... ~~ "+" ~~ WhitespaceCharacter ... -> "+",
   WhitespaceCharacter ... ~~ "=" ~~ WhitespaceCharacter ... -> "=",
   "[" ~~ WhitespaceCharacter ... -> "[",
   WhitespaceCharacter ... ~~ "]" -> "]"
   };

ClearAll[IAttr, ISimple];
IAttr[{key_String, val_String: ""}] := key -> If[val === "", _, val];
ISimple[{tag_String: "", attrs___Rule}] := XMLElement[
   If[tag === "*" || tag === "", _, tag],
   Switch[{attrs}, {}, _, {_}, {___, attrs, ___}, _, Condition @@ {a : {__}, hasAttrsQ[a, {attrs}]}],
   _];

selectorGrammar = "
         <selector> = <simple_selector> , { <comb_selector> } ;
    <comb_selector> = <combinator> , <simple_selector> <@ {#[[1]],#[[2]]}& ;
  <simple_selector> = [ <type_selector> | '*' ] , { <hash> | <class> | <attrib> } <@ ISimple[Flatten@#]& ;
       <combinator> = '+' | '>' | '~' | '?' ;
    <type_selector> = '_IdentifierString' ;
            <class> = '.' \[RightTriangle] '_WordString' <@ \"class\"\[Rule](a_String/;inAttrQ[a,#])& ;
             <hash> = '#' \[RightTriangle] '_IdentifierString' <@ \"id\"\[Rule]#& ;
           <attrib> = '[' \[RightTriangle] [ <s> ] \[RightTriangle] '_IdentifierString' , [ '=' \[RightTriangle] ( '_WordString' | <string> ) ] \[LeftTriangle] ']' <@ IAttr[Flatten[#]]& ;
          <string1> = '\"' \[RightTriangle] '_String' \[LeftTriangle] '\"' ;
          <string2> = \"'\" \[RightTriangle] '_String' \[LeftTriangle] \"'\" ;
           <string> = <string1> | <string2> ;
                <s> = { '?' } ;
  ";

code = ToTokens@selectorGrammar /. "'?'" -> "' '";
GenerateParsersFromEBNF[code];

With that in place, we simply need to change the parser in the positionSelector definition:

positionSelectorFP[ctx_, sel_String] :=
 Fold[positionSelector[ctx, #1, #2] &,
    Position[ctx, First@#]
    , Rest@#
    ] &@(pSELECTOR[Tokenize[sel, terminalRules]]~Flatten~3)

Test:

xml = ImportString["<!doctype html><html><h1>Animals</h1>", "XMLObject"];
positionSelectorFP[xml, "body"]
positionSelectorFP[xml, "*"]
Extract[xml, positionSelectorFP[xml, "h1"]]

(* ==> {{2, 3, 1}} *)
(* ==> {{2, 3, 1, 3, 1}, {2, 3, 1}, {2}} *)
(* ==> {XMLElement["h1", {}, {"Animals"}]} *)

xml = ImportString["<body><div id='abc' class='aClass'>Eh... <b id='b1'><div><span>What's </span>up, <span id='def'>doc?</span>up, <span id='def'>doc?</span></div></b></div><div class='bClass' attr1='val1' attr2='val2'>Legen<div class='cClass'>--<div class='dClass'>Wait for it</div>--</div>Dary!</div><div id='abc'>No. <div attr1='val3'>I <b id='b2'><span>am</span></b> your</div><div> father.</div></div></body>", "XML", "NormalizeWhitespace" -> False];
result = positionSelectorFP[xml, "div>span"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 2, 3, 1, 3, 1}, {2, 3, 1, 3, 2, 3, 1, 3, 3}, {2, 3, 1, 3, 2, 3, 1, 3, 5}} *)

(* ==> ColumnForm[{
XMLElement["span", {}, {"What's "}], 
XMLElement["span", {"id" -> "def"}, {"doc?"}], 
XMLElement["span", {"id" -> "def"}, {"doc?"}]}] *)

result = positionSelectorFP[xml, "[class]>div"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 2, 3, 2, 3, 2}, {2, 3, 2, 3, 2}} *)
(* ==> ColumnForm[{
XMLElement["div", {"class" -> "dClass"}, {"Wait for it"}], 
XMLElement["div", {"class" -> "cClass"}, {"--", 
XMLElement["div", {"class" -> "dClass"}, {"Wait for it"}], "--"}]}] *)

xml = ImportString["<!doctype html><html>
      <h1>Animals</h1>
      <h2 class='animal dog border-collie' id='lillith'><span class='christian Name'>Lillith&gt;</span></h2>
      <h2 class='animal dog mutt' id='maggie'><span class='christian Name'>Maggie</span><span class='nick Name'>Fatty</span></h2>
      </html>", "XMLObject"];

result = positionSelectorFP[xml, "h2+.dog"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 5}} *)
(* ==> ColumnForm[{
XMLElement["h2", {"class" -> "animal dog mutt", "id" -> "maggie"}, {
XMLElement["span", {"class" -> "christian Name"}, {"Maggie"}], 
XMLElement["span", {"class" -> "nick Name"}, {"Fatty"}]}]}] *)

result = positionSelectorFP[xml, "h2+#maggie"]
Extract[xml, result] // ColumnForm

(* ==> {{2, 3, 1, 3, 5}} *)
(* ==> ColumnForm[{
XMLElement["h2", {"class" -> "animal dog mutt", "id" -> "maggie"}, {
XMLElement["span", {"class" -> "christian Name"}, {"Maggie"}], 
XMLElement["span", {"class" -> "nick Name"}, {"Fatty"}]}]}] *)

positionSelectorFP[xml, "h1~.dog"]
positionSelectorFP[xml, "h1~b"]

(* ==> {{2, 3, 1, 3, 3}, {2, 3, 1, 3, 5}} *)
(* ==> {} *)

Awesome! Thanks a lot for sharing! – C. E. Jun 04 '14 at 11:30 — C. E., Jun 04 '14 at 11:30

CSS Selectors for Symbolic XML

1 Answers1

Custom solution:

Functional parsers

Linked