views:

55

answers:

1

I would like to scrape some dynamic data off of a website.

On the site, there are a couple of links at the top labeled "1", "2", "3", and "next". If a link labeled by a number is pressed, it dynamically loads in some data into a content div. If "next" is pressed, it goes to a page with labels "4", "5", "6", "next" and the data for page 4 is shown.

I want to scrape the data from the content div for all links pressed (I don't know how many there are, it just shows 3 at a time and "next").

The data in the content div is uniformly laid out (just text changes) across multiple pages.

I have tried capturing the ajax requests, thinking that I could get the raw request once and just have to change like a "pagenum" post parameter or something to load in a new page, but it turns out they do some funky stuff with asp that has some very long hex string post parameters that change on each request. I believe I could eventually get this to work, but it would be incredibly dirty and would be useless if the smallest thing changed.

My thinking is that I could use something like selenium to click on the hyperlinks and load the pages for me, sending back the info in the content div. The problem is that I don't know how many times I need to press the "next" button, so it isn't like I can script me pressing it X times. Is this something selenium can handle? If so, can you point me to a tutorial that talks about using selenium to scrape like this.. because most tutorials i have seen focus on using it for testing (which I know is its intended purpose).

A: 

I know IRobotSoft web scraper can do this easily. See their demo here: http://www.irobotsoft.com/help/ which scrapes pubmed data.

seagulf