tags:

views:

35

answers:

2

I'm building a tool that performs xpath 1.0 queries on XHTML documents. The requirement to use a namespace prefix in the query is killing me. The query looks like this:

html/body/div[@class='contents']/div[@class='body']/
    div[@class='pgdbbyauthor']/h2[a[@name][starts-with(.,'Quick')]]/
    following-sibling::ul[1]/li/a

(all on one line)

...which is bad enough, except because it's xpath 1.0, I need to use an explicit namespace prefix on each QName, so it looks like this:

ns1:html/ns1:body/ns1:div[@class='contents']/ns1:div[@class='body']/
    ns1:div[@class='pgdbbyauthor']/ns1:h2[ns1:a[@name][starts-with(.,'Quick')]]/
    following-sibling::ns1:ul[1]/ns1:li/ns1:a

To set up the query, I do something like this:

var xpathDoc = new XPathDocument(new StringReader(theText));
var nav = xpathDoc.CreateNavigator();
var xmlns = new XmlNamespaceManager(nav.NameTable);
foreach (string prefix in xmlNamespaces.Keys)
    xmlns.AddNamespace(prefix, xmlNamespaces[prefix]);    
XPathNodeIterator selection = nav.Select(xpathExpression, xmlns);

But what I want is for the xpathExpression to use the implicit default namespace.

Is there a way for me to transform the unadorned xpath expression, after it's been written, to inject a namespace prefix for each element name in the query?

I'm thinking, anything between two slashes, I could inject a prefix there. Excepting of course axis names like "parent::" and "preceding-sibling::" . And wildcards. That's what I mean by "finagle a default namespace".

Is this hack gonna work?


Addendum
Here's what I mean. suppose I have an xpath expression, and before passing it to nav.Select(), I transform it. Something like this:

string FixupWithDefaultNamespace(string expr)
{
    string s = expr;
    s = Regex.Replace(s, "^(?!::)([^/:]+)(?=/)", "ns1:$1");                        // beginning
    s = Regex.Replace(s, "/([^/:]+)(?=/)", "/ns1:$1");                             // stanza
    s = Regex.Replace(s, "::([A-Za-z][^/:*]*)(?=/)", "::ns1:$1");                  // axis specifier
    s = Regex.Replace(s, "\\[([A-Za-z][^/:*\\(]*)(?=[\\[\\]])", "[ns1:$1");        // predicate
    s = Regex.Replace(s, "/([A-Za-z][^/:]*)(?!<::)$", "/ns1:$1");                  // end
    s = Regex.Replace(s, "^([A-Za-z][^/:]*)$", "ns1:$1");                          // edge case
    s = Regex.Replace(s, "([-A-Za-z]+)\\(([^/:\\.,\\)]+)(?=[,\\)])", "$1(ns1:$2"); // xpath functions

    return s;
}

This actually works for simple cases I tried. To use the example from above - if the input is the first xpath expression, the output I get is the 2nd one, with all the ns1 prefixes. The real question is, is it hopeless to expect this Regex.Replace approach to work, as the xpath expressions get more complicated?

+1  A: 

No, the XPath W3C spec is explicit about this:

"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context."

Any attempt to "massage" dynamically an unknown in advance XPath expression in order to make it succeed in this situation is generally going to fail, meaning that one should be able to perform a complete parsing of the XPath expression and to isolate all element names that are not axes, operator or function names -- this is not something I would ask anyone doing. It may be even extremely difficult to correctly find the start of each location step (the "/" operator) because the string "/" may be part of a literal string expression.

One example of a (sub) expression that needs complete parsing is:

div div div

If the XPath expressions will have some constraints imposed on them, then such an approach might have some limited success, but again I won't advise going for it, because nobody can prove that the RegExes really work correctly in all cases.

Dimitre Novatchev
+1  A: 

If you know there is only one namespace (i.e. the XHTML namespace) and its defined as a default namespace then you can cheat by processing it with an XmlTextReader that is not namespace aware as follows:

            XmlTextReader tr = new XmlTextReader(new StringReader(@"<html xmlns=""http://www.w3.org/1999/xhtml""&gt;
  <head>
    <title>Test</title>
  </head>
  <body>
    <h1>Example</h1>
  </body>
</html>"));
            tr.Namespaces = false;
            XPathDocument doc = new XPathDocument(tr);
            tr.Close();
            Console.WriteLine(doc.CreateNavigator().SelectSingleNode("html/body/h1").Value);

That works for me and outputs "Example" so the path "html/body/h1" finds that "h1" element. Other options are to run the input through some stylesheet first to strip namespaces and then process the transformation result with stripped namespaces.

And of course you could consider not to rely on the Microsoft XPath 1.0 implementation but move to third party XPath 2.0 or XQuery 1.0 implementations like Saxon or like XQSharp. Then you can define a default element namespace for your XPath or XQuery expressions and use paths without prefixes to select elements in the XHTML namespace.

Martin Honnen
Ahh, interesting. I didn't know about the third-party XPath options! Thank you! And, it's not only xhtml. I'd want to solve the general case.
Cheeso