views:

553

answers:

1

I want to do some web scraping with GAE. (Infinite Campus Student Information Portal, fyi). This service requires you to login to get in the website. I had some code that worked using mechanize in normal python. When I learned that I couldn't use mechanize in Google App Engine I ended up using urllib2 + ClientForm. I couldn't get it to login to the server, so after a few hours of fiddling with cookie handling I ran the exact same code in a normal python interpreter, and it worked. I found the log file and saw a ton of messages about stripping out the 'host' header in my request... I found the source file on Google Code and the host header was in an 'untrusted' list and removed from all requests by user code.

Apparently GAE strips out the host header, which is required by I.C. to determine which school system to log you in, which is why it appeared like I couldn't login.

How would I get around this problem? I can't specify anything else in my fake form submission to the target site. Why would this be a "security hole" in the first place?

+2  A: 

App Engine does not strip out the Host header: it forces it to be an accurate value based on the URI you are requesting. Assuming that URI's absolute, the server isn't even allowed to consider the Host header anyway, per RFC2616:

  1. If Request-URI is an absoluteURI, the host is part of the Request-URI. Any Host header field value in the request MUST be ignored.

...so I suspect you're misdiagnosing the cause of your problem. Try directing the request to a "dummy" server that you control (e.g. another very simple app engine app of yours) so you can look at all the headers and body of the request as it comes from your GAE app, vs, how it comes from your "normal python interpreter". What do you observe this way?

Alex Martelli
http://webappecho.appspot.com/ is a good test for that.
Nick Johnson
nice pointer, thanks Nick!
Alex Martelli
Thanks! I'll try that!
Josh Patton