views:

859

answers:

4

Right now, I can crawl regular pages using urllib2.

request = urllib2.Request('http://stackoverflow.com')
request.add_header('User-Agent',random.choice(agents))
response = urllib2.urlopen(request)
htmlSource = response.read()
print htmlSource

However...I would like to simulate a POST (or fake sessions)? so that I can go into Facebook and crawl. How do I do that?

+1  A: 

You can do POST requests by first encoding the data using urllib, and then sending the request using urllib2 just as you are doing now.

This is explained in this article.

Justin Standard
+7  A: 

You'll need to keep the cookie your site of choice sends you when you log in; that's what keeps your session. With urllib2, you do this by creating an Opener object that supports cookie processing:

import urllib2, cookielib
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))

With this opener, you can do requests, either GET or POST:

content = opener.open(urllib2.Request(
    "http://social.netwo.rk/login",
    "user=foo&pass=bar")
).read()

As there's a second parameter to urllib2.Request, it'll be a POST request -- if that's None, you end up with a GET request. You can also add HTTP headers, either with .add_header or by handing the constructor a dictionary (or a tuple-tuple) of headers. Read the manual for urllib2.Request for more information.

That should get you started! Good luck.

(ps: If you don't need read access to the cookies, you can just omit creating the cookie jar yourself; the HTTPCookieProcessor will do it for you.)

AKX
+2  A: 

The Mechanize library is an easy way to emulate a browser in Python.

Walter
+1  A: 

OR you may use PyCurl as a choice...

pounds