views:

25

answers:

1

The site I'm trying to spider is using the javascript:

request.open("POST", url, true);

To pull in extra information over ajax that I need to spider. I've tried various permutations of:

r = mechanize.urlopen("https://site.tld/dir/" + url, urllib.urlencode({'none' : 'none'}))

to get Mechanize to get the page but it always results in me getting the login HTML again, indicating that something is wrong. Firefox doesn't seem to add any HTTP data to the POST according to Firebug, and I'm adding an empty field to try and force the urlopen to use "POST" instead of "GET" hoping the site ignores the field. I thought that Mechanize's urlopen DOES include cookies. But being HTTPS it's hard to wireshark the transaction to debug.

Is there a better way?

Also there doesn't seem to be decent API documentation for Mechanize, just examples. This is annoying.

+1  A: 

This was what I came up with:

req = mechanize.Request("https://www.site.com/path/" + url, " ")
req.add_header("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7")
req.add_header("Referer", "https://www.site.com/path")
cj.add_cookie_header(req)
res = mechanize.urlopen(req)

Whats interesting is the " " in the call to mechanize.Request forces it into "POST" mode. Obviously the site didn't choke on a single space :)

It needed the cookies as well. I debugged the headers using:

hh = mechanize.HTTPHandler()
hsh = mechanize.HTTPSHandler()
hh.set_http_debuglevel(1)
hsh.set_http_debuglevel(1)
opener = mechanize.build_opener(hh, hsh)
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.NOTSET)
mechanize.install_opener(opener)

Against what Firebug was showing.

fret