views:

235

answers:

3

Using cURL to scrape a secure (i.e. login) page, and I'm at my wits' end. I managed to successfully scrape two sites with little or no problems, and now I just can't log into this one. cURL gets all the pages I ask it to, but they're all not logged in, which doesn't help. So maybe someone could spot a mistake I've missed?

The code is:

$url_to = 'http://fastorder.newrock.es/store2009/index.php/customer/account/loginPost/';
$url_from = 'http://fastorder.newrock.es/store2009/index.php/customer/account/login/';
$url_get = 'http://fastorder.newrock.es/store2009/index.php/';
$name_pass = 'login%5Busername%5D=*****&login%5Bpassword%5D=*****&send=';

function login($link,$user,$from) {
    $fp = fopen("cookie.txt", "w");
    fclose($fp);
    $log = curl_init();
    curl_setopt($log, CURLOPT_REFERER, $from);
    curl_setopt($log, CURLOPT_URL, $link);
    curl_setopt($log, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($log, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($log, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6");
    curl_setopt($log, CURLOPT_TIMEOUT, 40);
    curl_setopt($log, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($log, CURLOPT_HEADER, TRUE);
    curl_setopt($log, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($log, CURLOPT_POST, TRUE);      
    curl_setopt($log, CURLOPT_POSTFIELDS, $user);
    $data = curl_exec($log);
    curl_close($log);
}

login($url_to,$name_pass,$url_from);

function get($url) {
    $get = curl_init();
    curl_setopt($get, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($get, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($get, CURLOPT_URL, $url);
    return curl_exec ($get);
    curl_close ($get);
}

$html = get($url_get);
echo $html;

This is the (more or less) same script that worked on the other two sites, and it manages to log in fine. What threw me off in the start are the codes in the $name_pass. Turns out the site has named name and password input fields as login[username] and login[password]. Why the hell for, I've no idea, but I've tried sending it both with codes and with brackets, and nothing helped.

Live HTTP Headers is giving me the following for the page:

http://fastorder.newrock.es/store2009/index.php/customer/account/loginPost/

POST /store2009/index.php/customer/account/loginPost/ HTTP/1.1
Host: fastorder.newrock.es
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: http://fastorder.newrock.es/store2009/index.php/customer/account/login/
Cookie: frontend=6tjul97q4mvn0046ier0k79li8
Content-Type: application/x-www-form-urlencoded
Content-Length: 81
login%5Busername%5D=*****&login%5Bpassword%5D=*****&send=
HTTP/1.1 302 Found
Date: Fri, 26 Feb 2010 12:29:19 GMT
Server: Apache/2.0.63 (CentOS)
X-Powered-By: PHP/5.2.10
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Location: http://fastorder.newrock.es/store2009/index.php/customer/account/
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

I've tried to copy everything I could to the cURL script, thinking there's some obscure way of blocking the scrip from logging in. But right now I'm totally stuck and I've got no idea what to do next. And I've dug through a lot of tutorials, and they all give advices that worked like a charm for the first two sites.

Halp?

A: 

It may be this:

login%5Busername%5D=*****&login%5Bpassword%5D=*****&send=

I'm no curl guru, but your script seems to be OK, so maybe you should not escape the characters.

I would do local tests with curl and this kind of login forms. Maybe you can debug what's wrong from there. If I'm right, there will be empty fields.

metrobalderas
Yeah, I thought about that too. Thing is, it's the website that escaping them, not me. I get that from Live HTTP Header, and so I checked the site to see the form. It's exactly as written up there - "login[username]" and "login[password]". Which would mean that the variable for them should be $_POST['login[username]']. Hence the escape. That's just bad coding, but I'm a site user, not the owner.
Xipe_Totec
A: 

Suggestion: Use Fiddler (www.fiddler2.com) to diff the request traffic, CURL vs your browser.

EricLaw -MSFT-
A: 

There is something broken with that store's registration/login. The activation email said to just login to activate the account. I've tried logging in multiple times but I get the error "This account is not activated." everytime I try to login.

Below is a quick change that prints the returned login page.

$url_to = 'http://fastorder.newrock.es/store2009/index.php/customer/account/loginPost/';
$url_from = 'http://fastorder.newrock.es/store2009/index.php/customer/account/login/';
$url_get = 'http://fastorder.newrock.es/store2009/index.php/';
$name_pass = 'login%5Busername%5D=*****&login%5Bpassword%5D=*****&send=';

function login($link,$user,$from) {
$fp = fopen("cookie.txt", "w");
fclose($fp);
$log = curl_init();
curl_setopt($log, CURLOPT_REFERER, $from);
curl_setopt($log, CURLOPT_URL, $link);
curl_setopt($log, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($log, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($log, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6");
curl_setopt($log, CURLOPT_TIMEOUT, 40);
curl_setopt($log, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($log, CURLOPT_HEADER, TRUE);
curl_setopt($log, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($log, CURLOPT_POST, TRUE);      
curl_setopt($log, CURLOPT_POSTFIELDS, $user);
$data = curl_exec($log);
curl_close($log);
return $data;
}

echo login($url_to,$name_pass,$url_from);

function get($url) {
$get = curl_init();
curl_setopt($get, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($get, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($get, CURLOPT_URL, $url);
return curl_exec ($get);
curl_close ($get);
}

$html = get($url_get);
echo $html;

Edit:
Is the cookies data is being written to the cookies file (cookie.txt)? If not...

  1. Check the file permissions, make sure its writable.

  2. A bug in earlier versions of php5 caused the cookies file option to be ignored.

Details on the bug are here: http://bugs.php.net/bug.php?id=33475
Solution: Add unset($log) after curl_close($log);

Its hard to debug this script w/o being able to test it.

John Himmelman
Yes, very broken. And not just code-wise. They advertise registration on a pretty subjective, private-user, level... and then they manually open it just for business users. So, uhm, yeah. Re: edited code. Interesting. It would seem the problem's in cookies. Because it prints out the "please enable cookies" page. Now only to figure what exactly's wrong with them. :/
Xipe_Totec
Yeah, that's right. I forgot that. You must specify an absolute path to the cookies file. Use getcwd().
metrobalderas
Added getcwd(). to all cookie lines. Still no dice. :(
Xipe_Totec
Xipe, it could be a bug in php5, see revisions to my answer above.
John Himmelman
Yeah, it's not being written at all. Even if I use the unset(). And I know it's hard to debug the script with no login data, but considering this is the company's login, I can't give it out. I've tried making a dummy registration, but they don't allow it, so that's pretty much out of the picture.
Xipe_Totec
A quick question. I managed to get it to write down cookie info. And it starts off: #HttpOnly_.fastorder.newrock.es TRUE /store2009 TRUE ... So, uhm, that means we're talking about HTTPOnly cookies, which invalidates the whole effort of cURLing the site? Or is there a workaround?
Xipe_Totec
Xipe, use a network analyzer tool, such as Wireshark (http://wireshark.org) to debug the curl requests by comparing the browser's requests to your scripts. Since you'll be using the same tool to look at both requests (in their entirety), it will be much easier to find any inconsistencies.
John Himmelman