Symbolic XML is a convenient way of managing XML in Mathematica, at least for casual use or small tasks. Here and elsewhere you can find examples of XML handling in Mathematica, including XSLT/XPath emulations. In my opinion, however, if you must perform frequent searches, content extraction, or deep transformations, e.g., Web scraping, Symbolic XML quickly becames clumsy and cumbersome to work with. If you have ever faced any kind of web programming, you're probably aware of the convenient CSS Selectors, the more common choice (being the other XPath) to tackle the aforementioned kind of problems.
I'm curious, are there implementations of CSS Selector (and/or XPath) engines in Mathematica? If not, would be feasible/sensible to implement like capabilities in Mathematica? Or is it better to stick with ad-hoc queries and follow the examples approach?
As elegant code is better than thousand opinions, we could illustrate the answers with this toy, greatly simplified, CSS Selector engine:
Pattern | Matches | Selector
---------------| -------------------------------------------------------------------- | ----------------
* | any element. | Universal
E | any E element (i.e., an element of type E). | Type
E.warning | any E element whose class is "warning". | Class
E#myid | any E element with ID equal to "myid". | Id
E F | any F element that is a descendant of an E element. | Descendant
E > F | any F element that is a child of an element E. | Child
E + F | any F element immediately preceded by a sibling element E. | Adjacent Sibling
E ~ F | any F element preceded by an E element | General sibling
E[foo] | any E element with the "foo" attribute set (whatever the value). | Attribute
E[foo="bar"] | any E element whose "foo" attribute value is exactly equal to "bar". | Attribute
A selector is a chain of one or more simple selectors separated by combinators. Combinators are: white space, ">", "+", and "~". White space may appear between a combinator and the simple selectors around it. A simple selector is either a type selector or universal selector followed immediately by zero or more attribute selectors, ID selectors, or class selectors, in any order. The universal selector may be omitted if followed by other simple selector components. The simple selector matches if all of its components match. Examples: "*div" or "div", all elements with tag "div"; "div *", all descendants at any level of all div elements; ".aClass+[attr=val]", direct descendants with attr equals val of elements with class aClass.
If you need a formal grammar, take a look to the W3C spec, but for this illustration I think that we can prescind of error handling or stringent processing.
Then, given context, any expression containing XMLElements, and selector, a string compliant with the toy CSS grammar, implement positionSelector[context, selector] to return the positions (same format of Position built-in) of XMLElements matching selector:
xml = ImportString["<!doctype html><html><h1>Animals</h1>", "XMLObject"];
positionSelector[xml, "body"]
positionSelector[xml, "*"]
Extract[xml, positionSelector[xml, "h1"]]
(* ==> {{2, 3, 1}} *)
(* ==> {{2, 3, 1, 3, 1}, {2, 3, 1}, {2}} *)
(* ==> {XMLElement["h1", {}, {"Animals"}]} *)
xml = ImportString["<body><div id='abc' class='aClass'>Eh... <b id='b1'><div><span>What's </span>up, <span id='def'>doc?</span>up, <span id='def'>doc?</span></div></b></div><div class='bClass' attr1='val1' attr2='val2'>Legen<div class='cClass'>--<div class='dClass'>Wait for it</div>--</div>Dary!</div><div id='abc'>No. <div attr1='val3'>I <b id='b2'><span>am</span></b> your</div><div> father.</div></div></body>", "XML", "NormalizeWhitespace" -> False];
result = positionSelector[xml, "div>span"]
Extract[xml, result] // ColumnForm
(* ==> {{2, 3, 1, 3, 2, 3, 1, 3, 1}, {2, 3, 1, 3, 2, 3, 1, 3, 3}, {2, 3, 1, 3, 2, 3, 1, 3, 5}} *)
(* ==>
XMLElement["span", {}, {"What's "}],
XMLElement["span", {"id" -> "def"}, {"doc?"}],
XMLElement["span", {"id" -> "def"}, {"doc?"}]}
*)
result = positionSelector[xml, "[class]>div"]
Extract[xml, result] // ColumnForm
(* ==> {{2, 3, 2, 3, 2, 3, 2}, {2, 3, 2, 3, 2}} *)
(* ==>
XMLElement["div", {"class" -> "dClass"}, {"Wait for it"}],
XMLElement["div", {"class" -> "cClass"}, {"--",
XMLElement["div", {"class" -> "dClass"}, {"Wait for it"}], "--"}]}]
*)
xml = ImportString["<!doctype html><html>
<h1>Animals</h1>
<h2 class='animal dog border-collie' id='lillith'><span class='christian Name'>Lillith></span></h2>
<h2 class='animal dog mutt' id='maggie'><span class='christian Name'>Maggie</span><span class='nick Name'>Fatty</span></h2>
</html>",
"XMLObject"];
result = positionSelector[xml, "h2+.dog"]
Extract[xml, result] // ColumnForm
(* ==> {{2, 3, 1, 3, 5}} *)
(* ==> XMLElement["h2", {"class" -> "animal dog mutt", "id" -> "maggie"}, {XMLElement["span", {"class" -> "christian Name"}, {"Maggie"}], XMLElement["span", {"class" -> "nick Name"}, {"Fatty"}]}]} *)
result = positionSelector[xml, "h2+#maggie"]
Extract[xml, result] // ColumnForm
(* ==> {{2, 3, 1, 3, 5}} *)
(* ==> XMLElement["h2", {"class" -> "animal dog mutt", "id" -> "maggie"}, {XMLElement["span", {"class" -> "christian Name"}, {"Maggie"}], XMLElement["span", {"class" -> "nick Name"}, {"Fatty"}]}]} *)
positionSelector[xml, "h1~.dog"]
(* ==> {{2, 3, 1, 3, 3}, {2, 3, 1, 3, 5}} *)
positionSelector[xml, "h1~b"]
(* ==> {} *)
To be clear, I'm not looking for a full-blown implementation, but strategies and idioms of how Mathematica can resolve XML queries, and in which ways is Mathematica better equipped for that than mainstream functional/imperative languages.