views:

32

answers:

1

I'm using HPricot's css search to identify a table within a web page. Here's a sample html snippet I'm parsing:

<table height=61 width=700>
<tbody>
<tr>
<td><font size=3pt color = 'Blue'><b><A NAME=a1>Some header text</A></b></font></td></tr>
 ...
</tbody></table>

There are lots of tables in the page. I want to find the table which contains the A Name=a1 reference. Right now, the way I'm doing it is

(page/"a[@name=a1]")[0].parent.parent.parent.parent.parent

I don't like this because

  • It is ugly
  • It is error prone (what if the folks who maintain the web page remove the tbody?)

Is there a way to tell hpricot to get me the table ancestor of the specified element?

Edit: Here's the full blown page I'm parsing: http://www.blonnet.com/businessline/scoboard/a.htm

The bits I'm interested in are the two tables, one with quarterly results and another with the annual results. Right now, the way I'm extracting those tables is by finding and and moving up from there.

A: 

Without seeing the whole page it's hard to give a definitive answer, but often the way you're going about it is the right answer. You have to find a decent landmark, then navigate from there, and if it involves backing up the chain then that's what you do.

You might be able to use XPATH to find the table then look inside it for the link, but that doesn't really improve things, it only changes them. Firebug, the Firefox plugin, makes it easy to get the XPATH to an element in the page, so you could find the table in question and have Firebug show you the path, or just copy it by right-clicking on the node in the xpath display, and past that into your lookup.

"It is ugly", well, maybe, but not all code is beautiful or elegant because not all problems lend themselves to beautiful and/or elegant solutions. Sometimes we have to be happy with "it works". As long as it works reliably and you know why then you're ahead of many other coders.

"... what if the folks who maintain the web page remove the tbody?", almost all parsing of HTML or XML suffers from the same concern because we're not in control of the source. You write your code as best as you can, comment the spots that are likely to fail if content changes, then cross your fingers and move on. Even if you were parsing tabular data from a TPS report you could run into the same problem.

The only thing I'd suggest doing differently, is to use the % (AKA "at") instead of / (AKA search). % returns only the first occurrence so you can drop the [0] index.

(page%"a[@name=a1]").parent.parent.parent.parent.parent

or

page%'//a[@name="a1"]/../../../../../..'

which uses the XPath engine to step back up the chain. That should be a little faster if speed is a consideration.

If you know that the target table is the only one with that width and height, you can use a more specific xpath:

page%'//table[@height=61 and @width=700]'

I recommend Nokogiri over Hpricot.


You can also use XPath from the top of the document down:

irb(main):039:0> print (doc/'//body/table[2]/tr/td[2]/table[2]').to_html[0..100]
<table height="61" width="700"><tbody>
<tr><td width="700" colspan="7" align="center"> <font size="3p=> nil

Basically the XPath pattern means:

Find the body tag, then the third table, then its row's third cell. In the cell locate the third table.

Note: Firefox automatically adds the <tbody> tag to the source, even if it wasn't there in the HTML file received. That can really mess you up trying to use Firefox to view the source to develop your own XPaths.

The other table you are after is /html/body/table[2]/tbody/tr/td[2]/table[3] according to Firefox so you have to strip the tbody. Also you don't need to anchor at /html.

Greg
Thank you for the % suggestion. I agree that all parsing of web pages is error prone if we don't have control of the source, and I'm not trying for hundred percent resilience in the face of change. But it would be nice to to able to say give me the "table" ancestor of this element. It would also express my intent better.
Rohith
Well, with a bit better info we could probably give you alternate solutions. How about adding the URL to the page you are attempting to parse to your question? Also, http://stackoverflow.com/questions/734178/hpricot-with-firebugs-xpath might be useful.
Greg
I've edited the question to add a link to the original page
Rohith
I added a bit different way of going after the tables.
Greg