views:

38

answers:

2

I'm writing a Java program that scrapes a web page for links and then stores them in a database. I'm having problems though. Using HTMLUnit, I wrote the following:

page.getByXPath("//a[starts-with(@href, \"showdetails.aspx\")]");

It returns the correct anchor elements, but I only want the actual path contained in the href attribute, not the entire thing. How can I do this, and further, how can I get the data contained between nodes:

<a href="">I need this data, too.</a>

Thanks in advance!

+1  A: 

The first (getting the href)

page.getByXPath("//a[starts-with(@href, \"showdetails.aspx\")]/@href");

The second (getting the text)

page.getByXPath("//a[starts-with(@href, \"showdetails.aspx\")]/text()");
dkackman
Allen Gingrich
see edit and lemme know if that what you're looking for. The XPath function text() will return the node contents (whether it be an attribute or element.
dkackman
The pre-edit was closer to what I need than the edit. The edit returns an empty bracket result [], while the pre-edit returned my above comment.Basically, I believe you were right at first, but I'm unsure how to access that data. My goal is to use this to pull off the links on the page, then loop through the links and get subsequent pages via the link paths, calling the page.getByXPath() many more times for each link.I'm sorry if this is confusing.
Allen Gingrich
Ack. Your right. I forgot attributes don't get a text child.
dkackman
A: 

I assume that getByXPath is a utility function written by you which uses XPath.evaluate? To get the string value you could use either xpath.evaluate(expression, object) or xpath.evaluate(expression, object, XMLConstants.STRING).

Alternatively you could call getNodeValue() on the attribute node returned by evaluating "//a[starts-with(@href, \"showdetails.aspx\")]/@href".

Jörn Horstmann