views:

109

answers:

2

Here is a piece of code that I use to fetch a web page HTML source (code) by its URL using Google App Engine:

from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
   print "content-type: text/plain"
   print
   print result.content

Everything is fine here, but sometimes I need to get an HTML source of a page from a site where I am registered and can only get an access to that page if I firstly pass my ID and password. (It can be any site, actually, like any mail-account-providing site like Yahoo: https://login.yahoo.com/config/mail?.src=ym&.intl=us or any other site where users get free accounts by firstly getting registered there). Can I somehow do it in Python (trough "Google App Engine")?

+3  A: 

You can check for an HTTP status code of 401, "authorization required", and provide the kind of HTTP authorization (basic, digest, whatever) that the site is asking for -- see e.g. here for more details (there's not much that's GAE specific here -- it's a matter of learning HTTP details and obeying them!-).

Alex Martelli
Alex, thank you for your answer again, but I just don't understand:(1) "You can check for an HTTP status code of 401" - where do I need to check this HTTP status code? I looked through HTML source of that Yahoo page that I mentioned in my question and I didn't find anything there related to HTTP status code; (2) "provide the kind of HTTP authorization (basic, digest, whatever) that the site is asking for" - again, how do I do it?;
brilliant
(3) The link you gave me leads to some kind of documentation of a robot (CheckUpDown robot). Is it like you are suggesting that I use that robot? If yes, then I am afraid I won't be able to use GAE in this case;
brilliant
I just posted the first question (on checking HTTP status code) here: http://stackoverflow.com/questions/1901701/how-to-check-for-an-http-status-code-of-401 So, if you want, you can answer there.
brilliant
@brilliant, I was just pointing you to a short doc on what 401 means. I have answered the other question with more details, including pointers on how to use urllib2 to provide basic authentication (urllib2 can do digest authentication, too, if that's what the domain you're visiting requires).
Alex Martelli
(1) "I was just pointing you to a short doc on what 401 means." - I see. (2) "I have answered the other question with more details" - yes, thank you. I am studying now all those materials.
brilliant
+1  A: 

As Alex said you can check for status code and see what type of autorization it wants, but you can not generalize it as some sites will not give any hint or only allow login thru a non standard form, in those cases you may have to automate the login process using forms, for that you can use library like twill (http://twill.idyll.org/) or code a specific form submit for each site.

Anurag Uniyal
Hello Anurag Uniyal! thank you for your response. I think I am missing some basics here: (1) "As Alex said you can check for status code and see what type of authorization it wants" - I don't know how to do it;
brilliant
(2) "...you may have to automate the log-in process using forms, for that you can use library like twill..." - Will it be possible to be done on "Google App Engines"? I mean, will using twill not come in conflict with "Google App Engines"?
brilliant
I just posted the first question (on checking HTTP status code) here: http://stackoverflow.com/questions/1901701/how-to-check-for-an-http-status-code-of-401 So, if you want, you can answer there.
brilliant
I am not sure but if twill is pure python then you can use it on GAE, else using urllib you can post forms anyway
Anurag Uniyal
I see. Thank you very much.
brilliant