views:

756

answers:

4

I want to write an XPath that can return some link elements on an HTML DOM.

The syntax is wrong, but here is the gist of what I want:

//web:link[@text='Login' THEN_TRY @href='login.php' THEN_TRY @index=0]

THEN_TRY is a made-up operator, because I can't find what operator(s) to use. If many links exist on the page for the given set of [attribute=name] pairs, the link which matches the most left-most attribute(s) should be returned instead of any others.

For example, consider a case where the above example XPath finds 3 links that match any of the given attributes:

link A: text='Sign In', href='Login.php', index=0
link B: text='Login', href='Signin.php', index=15
link C: text='Login', href='Login.php', index=22

Link C ranks as the best match because it matches the First and Second attributes.

Link B ranks second because it only matches the First attribute.

Link A ranks last because it does not match the First attribute; it only matches the Second and Third attributes.

The XPath should return the best match, Link C.

If more than one link were tied for "best match", the XPath should return the first best link that it found on the page.

A: 

Try the or operator, as in:

web:link[@text='Login' or @href='login.php' or @index=0]

However, that will probably give you all those nodes rather than only one in the priority specified.

Update
So, I tried this out and it works. It's long, but it should do what you need (with appropriate changes for your schema).

//link[@text='Login'] | //link[not(//link[@text='Login']) and @href='Login.php'] | //link[not(//link[@text='Login']) and not(//link[@href='Login.php']) and @index='0']

I ran it on the following test XML, commenting out each line to test the different parts and it works as expected.

<?xml version="1.0" encoding="utf-8"?>
<Test>
  <link text='Sign In' href='Login2.php' index="0"></link>
  <link  text='Login' href='Signin.php' index="15"></link>
  <link  text='LoginBlah' href='Login.php' index="22"></link>
</Test>

Update 2
I notice that I haven't quite solved the problem yet as you want the best match rather than a match in order of precedence. This can be done but would require a rather long XPath that does the equivalent of each combination in order. I don't know of any other way to simplify it.

Jeff Yates
+2  A: 

There's a brute-force solution. I'll demonstrate for two attributes instead of three.

(
  //web:link[@text != 'Login' and @href != 'Login.php'
             and not(//web:link[@text = 'Login' or @href = 'Login.php'])]
| //web:link[@text != 'Login' and @href = 'Login.php'
             and not(//web:link[@text = 'Login'])]
| //web:link[@text = 'Login' and @href != 'Login.php'
             and not(//web:link[@text = 'Login' and @href = 'Login.php'])]
| //web:link[@text = 'Login' and @href = 'Login.php']
)[1]

That is, select all the links where neither attribute matches, but only if there's no link that has a better match. Then select all the links that have the lesser attribute match, but only when there are no links with the superior attribute matching. The select links where only the first attribute matches, but only if there are no links where both attributes match. Then select links where both attributes match. Only one of those four conjuncts will be non-empty, so the "|" operator never actually combines anything. Finally, select the first link in document order, in case any of those node-sets had more than one element.

The reason I only did two attributes instead of three is because I didn't want to type out all eight cases. You can omit the first case if you're not interested in any links unless at least one of the attributes matches.

This is a situation where you might be better off just selecting all the candidates in the much simpler query Jeff showed, and then using other code to rank the results afterward, where you can more readily use iteration and variables to choose the best candidate.

If you can use XPath 2, then you can use the comma operator (or the concat function) to join node sequences (which supersede node-sets). Try this, for example:

(
  //web:link[@text  = 'Login' and @href  = 'Login.php' and @index  = 0]
, //web:link[@text  = 'Login' and @href  = 'Login.php' and @index != 0]
, //web:link[@text  = 'Login' and @href != 'Login.php' and @index  = 0]
, //web:link[@text  = 'Login' and @href != 'Login.php' and @index != 0]
, //web:link[@text != 'Login' and @href  = 'Login.php' and @index  = 0]
, //web:link[@text != 'Login' and @href  = 'Login.php' and @index != 0]
, //web:link[@text != 'Login' and @href != 'Login.php' and @index  = 0]
, //web:link[@text != 'Login' and @href != 'Login.php' and @index != 0]
)[1]


As an aside, here's an easy way to assign a rank to each link, which makes comparing them pretty straightforward. Imagine a bit field, one bit for each attribute you want to check. If the first attribute matches, set the left-most bit, else leave it unset. If the second attribute matches, set the next most significant bit, etc. So for your example, you get the following bit values:

011   link A: text='Sign In', href='Login.php',  index=0
100   link B: text='Login',   href='Signin.php', index=15
110   link C: text='Login',   href='Login.php',  index=22

To select the best match, treat the bit fields as binary numbers. Link A has a score of 3, link B a score of 4, and link C a score of 6. (This is a little reminiscent of how the specificity of CSS selectors is determined.) This is a way of modeling the ordering criteria, but now that I've typed it all out, I don't quite see that it leads to any concise solution in XPath.

Rob Kennedy
+2  A: 

The previous two answers seem to be not exact.

Here is one possible solution:

You want to find the first node with the maximum value for the following function:

100*number(@text='Login') 
+10*number(@href='Login.php') 
+ 1*number(@index=0)

In XPath 2.0 this can be expressed as a single XPath expression in the following way:

  /*/link[
           100*number(@text='Login') 
           +10*number(@href='Login.php') 
           + 1*number(@index=0)

          eq
             max(/*/link
                     /(100*number(@text='Login') 
                       +10*number(@href='Login.php') 
                       + 1*number(@index='0')
                       )
                )

          ]

In XPath 1.0 constructing such a one-lener expression would be extremely difficult, if possible at all, and even if possible, such an XPath expression will be impossible to understand, prove correct and/or maintain.

However, selecting the best-matching link element is possible within any language that is a host of XPath 1.0.

One example below is with XSLT 1.0 as the hosting language:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

    <xsl:template match="/">
      <xsl:for-each select="*/link">
        <xsl:sort data-type="number" order="descending" select=
        "100*(@text='Login') 
         +10*(@href='Login.php') 
         + 1*(@index=0)
        "/>
        <xsl:if test="position() = 1">
          <xsl:copy-of select="."/>
        </xsl:if>
      </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

when the above transformation is applied on this XML document:

<links>
  <link name="A" text="Sign in" href="Login.php" 
        index="0"/>
  <link name="B" text="Login" href="SignIn.php" 
        index="15"/>
  <link name="C" text="Login" href="Login.php" 
        index="22"/>
</links>

the correct result is produced:

<link name="C" text="Login" href="Login.php" index="22" />

This reminds me of another "Single XPath expression finding the best matches" problem I solved some seven year ago :)

Dimitre Novatchev
To be fair, I didn't specify XPath 1 or XPath 2. The simplicity of the XPath is also important. As the number of attributes grows, your XPath 2 solution is far less complex than Rob's and Jeff's solutions.
Thanks for accepting the solution. To be fair, the other two answers are *not* solutions. Jeff states this himself for his answer. Bob, while providing a correct idea, but at the end he says: "but now that I've typed it all out, I don't quite see that it leads to any concise solution in XPath."
Dimitre Novatchev
My XPath 1 answer wouldn't be extremely difficult, just extremely cumbersome. I demonstrated that it's possible, and I think I proved it correct. I wouldn't want to maintain it, though. I think both my suggested solutions would give correct answers, although neither is very good. Thank you for demonstrating how my last idea could indeed be used in an XPath solution. And please don't call me Bob.
Rob Kennedy
@Rob-kennedy Sorry for calling you Bob. As for your solutions, they seemed too-short and it seems not unlikely that an XMLdocument could be constructed as counter-example. Of course I do not have the time either to construct such example or to prove the correctness of your solution. We both now that only the fact that you *think* your solution is correct does not necessarily means it really is. This is one reason simpler, easier to prove correct, solutions are preferred.
Dimitre Novatchev
@Rob-kennedy As for "demonstrating how my last idea could indeed be used in an XPath solution", I did this seven years ago :) See the link at the end of my answer.
Dimitre Novatchev
A: 

I had a similar problem today and arrived at a solution that will work in an XSLT context. For a pure XPath solution you'll need one of the other approaches.

<xsl:variable name="first" select="/web:link[@text='Login']"/>
<xsl:variable name="second" select="/web:link[@href='login.php']"/>
<xsl:variable name="third" select="/web:link[@index=0]"/>
<xsl:variable name="theAnswer" 
 select="$first | $second[not($first)] | $third[not($first or $second)]"/>

Of course, the trick here is that an empty node set evaluates to false.

Dominic Cronin