views:

210

answers:

1

I am writing a crawler. Once after the crawler logs into a website I want to make the crawler to "stay-always-logged-in". How can I do that? Is a client (like browser, crawler etc.,) make a server to obey this rule? This scenario could occur when the server allows limited logins in day.

+5  A: 

"Logged-in state" is usually represented by cookies. So what your have to do is to store the cookie information sent by that server on login, then send that cookie with each of your subsequent requests (as noted by Aiden Bell in his message, thx).

See also this question:

http://stackoverflow.com/questions/1016765/how-to-use-cookielib-with-httplib-in-python

A more comprehensive article on how to implement it:

http://www.voidspace.org.uk/python/articles/cookielib.shtml

The simplest examples are at the bottom of this manual page:

http://www.python.org/doc/2.6.4/library/cookielib.html

You can also use a regular browser (like Firefox) to log in manually. Then you'll be able to save the cookie from that browser and use that in your crawler. But such cookies are usually valid only for a limited time, so it is not a long-term fully automated solution. It can be quite handy for downloading contents from a Web site once, however.

UPDATE:

I've just found another interesting tool in a recent question:

http://www.scrapy.org

It can also do such cookie based login:

http://doc.scrapy.org/topics/request-response.html#topics-request-response-ref-request-userlogin

The question I mentioned is here:

http://stackoverflow.com/questions/1804694/scrapy-domainname-for-spider

Hope this helps.

fviktor
+1: And send the cookie back again.
Aiden Bell
Also, he might have to add sporadic activity to the session to stop it expiring.
Aiden Bell
The session can expire due to a server side "limit" on session lifetime, even if you add sporadic activity. So the long term solution is to allow the crawler to log in if needed. But using a cookie saved from a browser after logging in manually and keeping it alive is simpler, indeed, as long as the server allows sessions of (essentially) unlimited lifetime.
fviktor
@fvivtor - How to know server allows sessions of unlimited lifetime? Are you referring to "Keep-alive" header? Can you be little more specific
Vadi
@Aiden Bell -- Can you explain the "sproadic activity"?
Vadi
I think no way to figure it out. Since the server can delete the server side session information even before the cookie expires in your browser. This deletion can be prevented by that sporadic activity. I think Aiden Bell meant periodic dummy requests to the given server even while your crawler is IDLE.
fviktor
Cookies also have a lifetime at the client side, but if you keep the cookie forever in Python, then that lifetime does not count any more.
fviktor