views:

52

answers:

4

Background: The page has a table with data in it. There are several hyperlinks that when clicked, the data in the table is replaced with new data. Also, the page is an aspx page.

Goal: I want to scrape the data in the table for all hyperlinks pressed.

I have looked at what is going on via firebug and when a hyperlink is clicked, it generates an http post back to the server via ajax. The problem is that there are a lot of really garbage post parameters being sent. I assume this is because asp does some sessioning type things. I assume that even if I copied the exact parameters my browser sent, most of them won't be valid later anyway.

How do people usually write http scripts that deal with this kind of stuff?

A: 

Fool-proof method I use is to just interpret JS from the page in my scraping script and let it fill all these params itself. Quickiest way to do this is to use some ready engine, like WebKit, and build your scraper on top of it.

Harder, but more flexible way is to use Google V8 or Mozilla's Spidermonkey JS engines, and provide your own DOM context to them.

Daniel Kluev
can you expand more on "provide your own DOM context to them" please?
Boris Yeltz
A: 

Most of time I use WatiN for simple scrapes. Only rarely do I write customer parser/scrapers anymore.

BioBuckyBall
A: 

I will use irobotsoft web scraper to do this. It should be very simple.

seagulf
A: 

Here is a Python example that uses webkit to parse the JavaScript in a webpage and provide you with the final HTML

Plumo