tags:

views:

168

answers:

3

I am new to XPath. I have a html source of the webpage

http://london.craigslist.co.uk/com/1233708939.html

Now I want to extract the following data from the above page

  1. Full Date
  2. Email - just below the date

I also want to find the existence of the button "Reply to this post" on the page

http://sfbay.craigslist.org/sfc/w4w/1391399758.html

Can anyone help me in writing the three XPath expressions for the above three data.

+3  A: 

You don't need to write these yourself, or even figure them out yourself. If you use the Firebug plugin, go to the page, right click on the elements you want, click 'Inspect element' and Firebug will popup the HTML in a viewer at the bottom of your browser. Right click on the desired element in the HTML viewer and click on 'Copy XPath'.

That said, the XPath expression you're looking for (for #3) is:

/html/body/div[4]/form/button

...obtained via the method described above.

Alex Marshall
+1  A: 

As of your first page it's just impossible to do because this is not the way xpath works. In order for an xpath expression to select something that "something" must be a node (ie an element)
The second page is fairly easy, but you need an "id" attribute in order to do that (or anything that can make sure your button is unique). For example if you are sure the text "Reply to this post" correctly identify the button just do it with
//button["Reply to this post"]

phunehehe
+3  A: 

I noticed that the DTD is HTML 4/01 Transitional and not XHTML for the first link, so there's no guarantee that this is a valid XML document, and it may not be loaded correctly by an XML parser. In fact, I see several tags that aren't properly closed (i.e. <hr>, etc)

I don't know the first one off hand, and the third one was just answered by Alex, but the second one is /html/body/a[0].

ristonj
Further to ristonj's response, there are also numerous HTML sanitizers out there for Ruby, Java, [you name it] that will convert SGML documents (like HTML 4.01) to XML which you could run first if you want to scrape pages programmatically.
Alex Marshall
yes Marshall. I am scraping pages through a java program. For that first I am getting the html source of the page and then want to use either "regex" or "xpath" to scrape the desired information. How can I use to use HTML sanitizers to convert that html source in String format to SGML document. Is there any external library for that. If yes the can you please tell me the download URL of that jar file.The main concern is speed of the program.
Yatendra Goel
@Yatendra Goel: I've used the WebHarvest library (http://web-harvest.sourceforge.net) to great success in past projects. I'd recommend that you start there. It lets you declaratively define scrapers in config files that it then runs, rather than you having to "manually" scrape pages in code written yourself. You can then store the scraped values in variables and retrieve them for use in your code and it's much easier than what you're doing at the moment.
Alex Marshall