views:

216

answers:

2

If I have HTML that looks like:

<td class="blah">&nbs;<a href="http://....."&gt;????&lt;/a&gt;&amp;nbsp;&lt;/td&gt;

Could I get the ???? value using xpath? What would it look like?

A: 

Why would you use an XML parser to parse HTML? I would suggest using a dedicated Java HTML parser, there are many, but I haven't tried any myself.

As for your question, would it work, I suspect it will not work, you will get an error when trying to parse it as HTML right at &nbs; if not earlier.

hhafez
+1  A: 

To use XPath you usually need XML not HTML, but some parsers (e.g. the one built into PHP) have a relaxed Mode which will parse most HTML, too.
If you want to find all <a> that are direct children of <td class="blah"> the XPath you need is

//td[@class = 'blah']/a
or
//td[@class = 'blah']/a[@href = 'http://...']

(depending on whether you only want the one url or all urls)
This will give you a Set of Nodes. You'll need to iterate through it and then check for the nodeType of the firstChild (supposed to be a text node) and the number of child nodes (supposed to be 1). Then the firstChild will contain the ????

Mene