views:

208

answers:

5

I'm trying to find all the tables below my current node without also including the nested tables. In other words, if I have this, i want to find "yes" and not "no":

<table> <!-- outer table - no -->
  <tr><td>
    <div> <!-- *** context node *** -->
      <table> <!-- yes -->
        <tr><td>
          <table> ... </table> <!-- no -->
        </td></tr>
      </table>
      <table> <!-- yes -->
        <tr><td>
          <table> ... </table> <!-- no -->
        </td></tr>
      </table>
    </div>
  </td></tr>
</table>

Is there any easy way to do this in XPath 1.0? (In 2.0, it'd be .//table except .//table//table, but I don't have a 2.0 as an option.)

EDIT: please, the answers so far are not respecting the idea of current context node. I don't know how far down the first layer of table might be (and it might differ), and I also don't know if I might be inside another table (or two or three).

Literally, I want what .//table except .//table//table in XPath 2.0 would be, but I have only XPath 1.

A: 

I think you want child::table aka table

#!/usr/bin/perl --
use strict;
use warnings;

use HTML::TreeBuilder;
{
  my $tree = HTML::TreeBuilder->new();

  $tree->parse(<<'__HTML__');
<table> <!-- outer table - no -->
  <tr><td>
    <div> <!-- *** context node *** -->
      <table> <!-- yes -->
        <tr><td>
          <table> ... </table> <!-- no -->
        </td></tr>
      </table>
      <table> <!-- yes -->
        <tr><td>
          <table> ... </table> <!-- no -->
        </td></tr>
      </table>
    </div>
  </td></tr>
</table>
__HTML__

  sub HTML::Element::addressx {
    return join(
      '/',
      '/', # // ROOT
      reverse(    # so it starts at the top
        map {
          my $n = $_->pindex() || '0';
          my $t = $_->tag;
          $t . '['. $n .']'
          }         # so that root's undef -> '0'
          $_[0],    # self and...
        $_[0]->lineage
      )
    );
  } ## end sub HTML::Element::addressx

  for my $td ( $tree->look_down( _tag => qr/div|table/i ) ) {
    print $td->addressx, "\n";
  }
  $tree->delete;
  undef $tree;
}
__END__
//html[0]/body[1]/table[0]
//html[0]/body[1]/table[0]/tr[0]/td[0]/div[0]
//html[0]/body[1]/table[0]/tr[0]/td[0]/div[0]/table[0]
//html[0]/body[1]/table[0]/tr[0]/td[0]/div[0]/table[0]/tr[0]/td[0]/table[0]
//html[0]/body[1]/table[0]/tr[0]/td[0]/div[0]/table[1]
//html[0]/body[1]/table[0]/tr[0]/td[0]/div[0]/table[1]/tr[0]/td[0]/table[0]

and second part

#!/usr/bin/perl --

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_content(<<'__HTML__');
<table> <!-- outer table - no -->
  <tr><td>
    <div> <!-- *** context node *** -->
      <table> <!-- yes -->
        <tr><td>
          <table> ... </table> <!-- no -->
        </td></tr>
      </table>
      <table> <!-- yes -->
        <tr><td>
          <table> ... </table> <!-- no -->
        </td></tr>
      </table>
    </div>
  </td></tr>
</table>
__HTML__



#~ for my $result ($tree->findnodes(q{//html[0]/body[1]/table[0]/tr[0]/td[0]/div[0]})) {
for my $result ($tree->findnodes(q{/html/body/table/tr/td/div})) {
    print $result->as_HTML,"\n\n";
    for my $table( $result->findnodes(q{table}) ){ ## child::table
        print "$table\n";
        print $table->as_HTML,"\n\n\n";
    }

}

__END__
<div><table><tr><td><table><tr><td> ... </td></tr></table></td></tr></table><table><tr><td><table><tr><td> ... </td></tr></table></td></tr></table></div>


HTML::Element=HASH(0xc6c964)
<table><tr><td><table><tr><td> ... </td></tr></table></td></tr></table>



HTML::Element=HASH(0xc6cbf4)
<table><tr><td><table><tr><td> ... </td></tr></table></td></tr></table>
ricky
Nope, not respecting the context node.
Randal Schwartz
A: 

Well, if I understand it, the content_list can solve:

my $table_one = $tree->findnodes('/html//table')->[1];

for ( $table_one->content_list ) {
    last if $_->exists('table');
    print $_->as_text;
}   

:)

Mantovani
Nope. That doesn't respect the context node.
Randal Schwartz
A: 

What about .//table[not(.//table)]? Sorry for brevity, I'm on my phone.

Dominic Mitchell
Nope, that finds all tables that don't have tables in them. I want all tables that are not within tables.
Randal Schwartz
OK, how about .//table[not(ancestor::table)] ? That's quite likely to be inefficient though, unless you're doing it in something like eXist, which has the indexes to support it.
Dominic Mitchell
Nope. That finds all tables as long as they're not within *any* table. But consider what happens if our context node is already within a table. It'd find *nothing*. Nope, not the answer.
Randal Schwartz
A: 
Mads Hansen
Yeah, that still won't work, since it'll rule out any tables that are within any table that are within any div. :) Not respecting the context again.
Randal Schwartz
Updated my answer. It's not pure XPATH, but is an XPATH 1.0 (and XSLT 1.0) solution.
Mads Hansen
Yeah, no XSLT here. So that doesn't do it either. {Sigh}.
Randal Schwartz
A: 

After investigating it here and elsewhere, the answer seems to be "you can't, and that's why we have XPath 2.0". Oh well.

Randal Schwartz