views:

134

answers:

3

I would like to write a python script to crawl a social network website. The aim of the script should be to retrieve a piece of the social graph (friendships relationship).

The website does not provide any API.

The problem is: how can i crawl a website in python which pretends a login session to access the contact pages (for example, http://www.anobii.com/junemiller/friends )? Well, I have my login\password and I'd use it to login and retrieve, but I don't know how to use it to login via python to establish a session to access the pages. Any suggestion about python modules or methods?

Thanks, Jacopo

+2  A: 

First of all, you should check if the social network provides an API to do this. Also, check if what you want to do is allowed in the terms of service, or you'll risk being blocked/banned.

If there is no API and you're allowed to crawl the system this way, look into tools such as mechanize or twill to simulate browser/cookie/session behaviour and to provide the appropriate scraping.

Alternatively, implement this yourself using lxml.html, urllib2, the cookielib module and so on.

Ivo van der Wijk
Ivo, thanks for your answer. Well, i forgot to specify there aren't API, so I must simulate browser/cookie/session behaviour.
trampj
A: 

You should investigate Mechanize. From the documentation:

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize.

Alternately you can roll your own using the urllib2 and other built in Python modules.

As @Ivo said, do check if the site has an API to do this for you first. Facebook for instance has the Graph API to do pretty much what you described.

Manoj Govindan
+1  A: 

You can also use Scrapy, which already handles cookies and web sessions.

There's an example of how to perform a login in the official documentation: http://doc.scrapy.org/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login

Scrapy is implemented using Asynchronous IO so it should be faster than Mechanize or twill.

Pablo Hoffman