views:

18

answers:

2

I want to crawl some data out of a phpBB forum i'm a member of. But for that, login is required. I can login using cURL, but if I try to crawl the data after logging in using cURL, it still shows that I need to login before viewing that page. Is it possible to login using cURL AND retain that session to do some farther job?

Another thing, that forum usually shows a confirmation page after logging in and then after 5sec, automatically redirects to the index page. And the thing is, if I login using cURL, my script also follow that header location and shows me that page..

Any workaround of this?

+1  A: 

This is what usually works for me


$timeout=5;
$file='cookies.jar';
$this->handle=curl_init('');
curl_setopt($this->handle, CURLOPT_COOKIEFILE,  $file);
curl_setopt($this->handle, CURLOPT_COOKIEJAR,   $file);
curl_setopt($this->handle, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($this->handle, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($this->handle, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($this->handle, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($this->handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 (.NET CLR 3.5.30729)");
curl_setopt($this->handle, CURLOPT_TIMEOUT, round($timeout,0));
curl_setopt($this->handle, CURLOPT_CONNECTTIMEOUT, round($timeout,0));

and i generally use it like this


$now=grab_first_page();
if(not_logged_in($now)) {
   send_login_info();
}
if(not_logged_in()) { end_of_script_with_error(); }
// rest of script

This way the cookies are kept across sessions and the script does not have to login every time it does something.

--- explian for below ----

Im using an object, but you can replace $this->handle with a simple variable named $mycurl, the lines will be like


$mycurl=curl_init(''
curl_setopt($mycurl, CURLOPT_COOKIEFILE, $file)

What the code below does is: - initialize "a curl instance" (to keep it simple) (3rd line) - 4th and 5th line: save cookies to a file. Curl works just like a browser, so when you login to a page with curl it keeps the cookies with the authentication data in memory. I'm telling it to save it to a file so that the second time i run the script it will have the same cookies and will not need to authenticate again. Or you can have multiple scripts using the same cookie file, and just one for login that you run every 24 hours or whenever you're logged out... - other settings: * followlocation - when curl receives a http redirect it should return the page it was redirected to, not the redirect code * useragent - curl presents itself as firefox * timeout - how much time should it wait for a connection to be established, 5 or 10 is more than enough usually

I have put a simple class i use here http://pastebin.com/Rfpc103X

you can use it like this



// -- initialize curl
$ec=new easyCurl;

// -- set some options
//if the file you are in right now is named file_a.php it will create a file_a.jar cookie file
$ec->start(str_replace('.php','.jar',__FILE__));
$ec->headersPrepare(false);
$ec->prepareTimeOut(20);

$url='http://www.google.com/';

// --- set url
$ec->curlPrepare($url);

// --- get the actual data
$page=$ec->grab();

echo $page;

// to send GET data
$get_data=array('id'=>10);
$ec->curlPrepare($url,$get_data);

// and to post data
$post_data=array('user'=>'blue','password'=>'black');
$ec->curlPrepare($url,array(),$post_data);

It handles automatically the settings for POST/GET and other option i usually encounter. I hope the examples above will be useful to you. Good luck.

vlad b.
Thanx for the reply, but can you please explain them? What exactly you're doing using the cookies.jar file? And I think you're using some framework for the second script.. is it?
Bibhas
Explained some info in the first post, added a simple-to-use class. If you have more questions feel free to ask. What helps usually is, every time you grab a file, to save it to the hard drive in a txt format, and look at what response are you getting.
vlad b.
Thank you very much for the explanation. Am gonna try it right away. :)
Bibhas
Works like a charm.. :) Will post farther queries, if any, later. :)
Bibhas
A: 

Yes, you have to save the cookies. To do that, you can create a cookie jar on login, that you reuse whenever you access the forum later.

curl --cookie-jar cjar -d "somelogindata" http://example.com/phpbb/login.php

That creates a cjar cookie jar file, which you then can reuse in later requests:

curl --cookie-jar cjar --cookie cjar http://example.com/phpbb/viewforum.php?foobar

The --cookie-jar option specifies a file where cookies are saved; to use them, you use the --cookie option. To update cookies, you should always provide the --cookie-jar option as well.

poke