Writing crawler that stay logged in with any server

"Logged-in state" is usually represented by cookies. So what your have to do is to store the cookie information sent by that server on login, then send that cookie with each of your subsequent requests (as noted by Aiden Bell in his message, thx).

See also this question:

http://stackoverflow.com/questions/1016765/how-to-use-cookielib-with-httplib-in-python

A more comprehensive article on how to implement it:

http://www.voidspace.org.uk/python/articles/cookielib.shtml

The simplest examples are at the bottom of this manual page:

http://www.python.org/doc/2.6.4/library/cookielib.html

You can also use a regular browser (like Firefox) to log in manually. Then you'll be able to save the cookie from that browser and use that in your crawler. But such cookies are usually valid only for a limited time, so it is not a long-term fully automated solution. It can be quite handy for downloading contents from a Web site once, however.

UPDATE:

I've just found another interesting tool in a recent question:

http://www.scrapy.org

It can also do such cookie based login:

http://doc.scrapy.org/topics/request-response.html#topics-request-response-ref-request-userlogin

The question I mentioned is here:

http://stackoverflow.com/questions/1804694/scrapy-domainname-for-spider

Hope this helps.

+1: And send the cookie back again.

Aiden Bell 2009-11-26 15:25:59

Also, he might have to add sporadic activity to the session to stop it expiring.

Aiden Bell 2009-11-26 15:26:58

The session can expire due to a server side "limit" on session lifetime, even if you add sporadic activity. So the long term solution is to allow the crawler to log in if needed. But using a cookie saved from a browser after logging in manually and keeping it alive is simpler, indeed, as long as the server allows sessions of (essentially) unlimited lifetime.

fviktor 2009-11-26 16:40:10

@fvivtor - How to know server allows sessions of unlimited lifetime? Are you referring to "Keep-alive" header? Can you be little more specific

Vadi 2009-11-26 17:23:05

@Aiden Bell -- Can you explain the "sproadic activity"?

Vadi 2009-11-26 17:24:09

I think no way to figure it out. Since the server can delete the server side session information even before the cookie expires in your browser. This deletion can be prevented by that sporadic activity. I think Aiden Bell meant periodic dummy requests to the given server even while your crawler is IDLE.

fviktor 2009-11-27 04:13:19

Cookies also have a lifetime at the client side, but if you keep the cookie forever in Python, then that lifetime does not count any more.

fviktor 2009-11-27 04:14:00

ansaurus

tags:

views:

answers:

Writing crawler that stay logged in with any server

related questions