views:

416

answers:

2

If you visit this link right now, you will probably get a VBScript error.

On the other hand, if you visit this link first and then the above link (in the same session), the page comes through.

The way this application is set up, the first page is meant to serve as a frame in the second (main) page. If you click around a bit, you'll see how it works.

My question: How do I scrape the first page with Python? I've tried everything I can think of -- urllib, urllib2, mechanize -- and all I get is 500 errors or timeouts.

I suspect the answers lies with mechanize, but my mechanize-fu isn't good enough to crack this. Can anyone help?

A: 

You might also try BeautifulSoup in addition to Mechanize. I'm not positive, but you should be able to parse the DOM down into the framed page.

I also find Tamper Data to be a rather useful plugin when I'm writing scrapers.

Yancy
+6  A: 

It always comes down to the request/response model. You just have to craft a series of http requests just that you get the desired responses. In this case, you need the server to treat each request as part of the same session. To do that, you need to figure out what how the server is tracking the session. It could be a number of things, of which cookies, hidden inputs, form actions or post data, and query strings are the most common. But if I had to guess I'd put my money on a cookie in this case (I haven't checked the links). If this holds true, you need to send the first request, save the cookie you get back, and then send that cookie along with the 2nd request.

It could also be that the initial page will have buttons and links that get you to the second page. Those links will have <A href="http://cad.chp.ca.gov/iiqr.asp?Center=RDCC&amp;LogNumber=0197D0820&amp;t=Traffic%20Hazard&amp;l=3358%20MYRTLE&amp;b="&gt; where a lot of the gobbedlygook is generated by the first page.

The "Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b=" encodes some session information that you must get from the first page.

And, of course, you might even need to do both.

Joel Coehoorn
Felt the need to edit this most excellent answer to include the URL session tracking as well as the cookie session tracking.
S.Lott
Thanks for pushing me in the right direction. The approach you outlined with cookie handling was exactly the right solution, and for me the answer was to manually handle cookies with mechanize [as outlined here][1].[Et voila!][2][1] http://wwwsearch.sourceforge.net/mechanize/doc.html[2] http://twitter.com/humboldtCHP
hanksims