ansaurus

Question

XPath Expression

Answer 1

+3 A:

You don't need to write these yourself, or even figure them out yourself. If you use the Firebug plugin, go to the page, right click on the elements you want, click 'Inspect element' and Firebug will popup the HTML in a viewer at the bottom of your browser. Right click on the desired element in the HTML viewer and click on 'Copy XPath'.

That said, the XPath expression you're looking for (for #3) is:

/html/body/div[4]/form/button

...obtained via the method described above.

Alex Marshall 2009-09-25 15:05:41

Answer 2

+1 A:

As of your first page it's just impossible to do because this is not the way xpath works. In order for an xpath expression to select something that "something" must be a node (ie an element)
The second page is fairly easy, but you need an "id" attribute in order to do that (or anything that can make sure your button is unique). For example if you are sure the text "Reply to this post" correctly identify the button just do it with
//button["Reply to this post"]

phunehehe 2009-09-25 15:07:36

Answer 3

+3 A:

I noticed that the DTD is HTML 4/01 Transitional and not XHTML for the first link, so there's no guarantee that this is a valid XML document, and it may not be loaded correctly by an XML parser. In fact, I see several tags that aren't properly closed (i.e. <hr>, etc)

I don't know the first one off hand, and the third one was just answered by Alex, but the second one is /html/body/a[0].

ristonj 2009-09-25 15:09:36

Further to ristonj's response, there are also numerous HTML sanitizers out there for Ruby, Java, [you name it] that will convert SGML documents (like HTML 4.01) to XML which you could run first if you want to scrape pages programmatically.

Alex Marshall 2009-09-25 15:11:59

yes Marshall. I am scraping pages through a java program. For that first I am getting the html source of the page and then want to use either "regex" or "xpath" to scrape the desired information. How can I use to use HTML sanitizers to convert that html source in String format to SGML document. Is there any external library for that. If yes the can you please tell me the download URL of that jar file.The main concern is speed of the program.

Yatendra Goel 2009-09-25 15:18:41

@Yatendra Goel: I've used the WebHarvest library (http://web-harvest.sourceforge.net) to great success in past projects. I'd recommend that you start there. It lets you declaratively define scrapers in config files that it then runs, rather than you having to "manually" scrape pages in code written yourself. You can then store the scraped values in variables and retrieve them for use in your code and it's much easier than what you're doing at the moment.

Alex Marshall 2009-09-25 17:19:24

ansaurus

tags:

views:

answers:

XPath Expression

related questions