views:

316

answers:

3

I've got an XHTML document, and I want to select the only table in it with class="index".

If I understand correctly, the descendant axis will select all nodes directly and indirectly descending from the current node, so here's what I've got.

//descendant::table[@class="index"]

It doesn't appear to be working when tested with xmlstarlet. Is my tool broke, or is the XPath expression wrong?

+2  A: 

I think //table[@class="index"] is what you want

Brian Agnew
A: 

Yes, the descendant axis selects all nodes descending from the context node. But the key here is the context node.

For instance, descendant::span will retrieve all span descendants of the current node. In the same vein, descendant::* will retrieve all descendant elements of the current node.

If you need to match the table as well as children, the XPath you provided works fine during my test :

//descendant::table[@class="index"]

... selects the Table itself and childnodes.

If you only need to match the table's children, first match the node you want and then match its descendants:

//table[@class="index"]/descendant::*

.. Selects only child nodes of the Table.

Cerebrus
+1  A: 

Based on your example page (metacritic.com/film/highscores.shtml), I would say you need to use:

//TABLE[@CLASS="index"] 
(or /descendant::TABLE[@CLASS="index"])

This is because the TABLE with CLASS index is written in upper case on your example page (XML and XPath are case sensitive).

This will work if you are targeting a specific page, but will probably become a problem if different pages use different case for the same html tags.

Then you'll need an abomination like

//TABLE[@CLASS="index" or @class="index" or @Class="index" or ...]
|//table[@CLASS="index" or @class="index" or ...]
|...

So you'll probably need to keep using Tidy before extracting information, or switch to a tool that's specialized for HTML scraping (instead of XPath)

ckarras
indeed, tidy is part of the process, but it tanks on some poorly formatted HTML that puts a td inside a form. I've already got a nearly working version based on BeautifulSoup and uTidy; figuring out how to fix ugly form via tidy or sed is the next step I think.
jldugger