Without seeing the whole page it's hard to give a definitive answer, but often the way you're going about it is the right answer. You have to find a decent landmark, then navigate from there, and if it involves backing up the chain then that's what you do.
You might be able to use XPATH to find the table then look inside it for the link, but that doesn't really improve things, it only changes them. Firebug, the Firefox plugin, makes it easy to get the XPATH to an element in the page, so you could find the table in question and have Firebug show you the path, or just copy it by right-clicking on the node in the xpath display, and past that into your lookup.
"It is ugly", well, maybe, but not all code is beautiful or elegant because not all problems lend themselves to beautiful and/or elegant solutions. Sometimes we have to be happy with "it works". As long as it works reliably and you know why then you're ahead of many other coders.
"... what if the folks who maintain the web page remove the tbody?", almost all parsing of HTML or XML suffers from the same concern because we're not in control of the source. You write your code as best as you can, comment the spots that are likely to fail if content changes, then cross your fingers and move on. Even if you were parsing tabular data from a TPS report you could run into the same problem.
The only thing I'd suggest doing differently, is to use the %
(AKA "at") instead of /
(AKA search). %
returns only the first occurrence so you can drop the [0]
index.
(page%"a[@name=a1]").parent.parent.parent.parent.parent
or
page%'//a[@name="a1"]/../../../../../..'
which uses the XPath engine to step back up the chain. That should be a little faster if speed is a consideration.
If you know that the target table is the only one with that width and height, you can use a more specific xpath:
page%'//table[@height=61 and @width=700]'
I recommend Nokogiri over Hpricot.
You can also use XPath from the top of the document down:
irb(main):039:0> print (doc/'//body/table[2]/tr/td[2]/table[2]').to_html[0..100]
<table height="61" width="700"><tbody>
<tr><td width="700" colspan="7" align="center"> <font size="3p=> nil
Basically the XPath pattern means:
Find the body tag, then the third table, then its row's third cell. In the cell locate the third table.
Note: Firefox automatically adds the <tbody>
tag to the source, even if it wasn't there in the HTML file received. That can really mess you up trying to use Firefox to view the source to develop your own XPaths.
The other table you are after is /html/body/table[2]/tbody/tr/td[2]/table[3]
according to Firefox so you have to strip the tbody
. Also you don't need to anchor at /html
.