views:

316

answers:

3

So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id?

#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP::Debug qw(+);
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Request::Common;

my $un = 'username';
my $pw = 'password';

my $url = 'https://subdomain.url.com/index.do';

my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0);
$agent->{onerror}=\&WWW::Mechanize::_warn;
$agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3');
$agent->get($url);

$agent->form_name('form');
$agent->field(username => $un);
$agent->field(password => $pw);
$agent->click("Log In");

print "After Login Cookie: ";
print $agent->cookie_jar->as_string();
print "\n\n";

my $searchURL='https://subdomain.url.com/search.do';
$agent->get($searchURL);    

print "After Search Cookie: ";
print $agent->cookie_jar->as_string();
print "\n";

The output:

After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0

After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0

Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?

$ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request...

Also for the CERT In using the first option in this list, is this correct?

X.509 Certificate (PEM)
X.509 Certificate with chain (PEM)
X.509 Certificate (DER)
X.509 Certificate (PKCS#7)
X.509 Certificate with chain (PKCS#7)
A: 

Setup the cookie jar, something akin to this:

my $cookie = HTTP::Cookies->new(file => 'cookie',autosave => 1,);
my $mech = WWW::Mechanize->new(cookie_jar => $cookie, ....);
hpavc
This cookie_jar => {} sets the cookie in memory, shouldn't this work?
Phill Pafford
The session cookie won't be written to disk anyway, so this does nothing.
rjh
A: 

If your session cookie changes every page load, then likely you are not logging in correctly. But you could try forcing the JSESSIONID to be the same for each request. Construct your own cookie jar and tell WWW::Mechanize to use it:

my $cookie_jar = HTTP::Cookies->new(file => 'cookies', autosave => 1, ignore_discard => 1);
my $agent = WWW::Mechanize->new(cookie_jar => $cookie_jar, autocheck => 0);

The ignore_discard => 1 means that even session cookies are saved to disk (normally they are discarded for security reasons).

Then, after logging in, call:

$cookie_jar->save;

Then, after each request:

$cookie_jar->revert;  # re-loads the save

Alternately, you could sub-class HTTP::Cookies and override the set_cookie method to reject re-setting the session cookie if it already exists.


Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?

Some browsers (Internet Explorer for example) prompt for a security certificate even if one is not needed. If you are not getting any errors and the response content looks good, you probably don't need to set one.

If you do have a certificate file, check the POD for Crypt::SSLeay. Your certificate is PEM0-encoded so yes, you want to set $ENV{HTTPS_CERT_FILE} to the path of your cert. You might want to set $ENV{HTTPS_DEBUG} = 1 to see what's happening.

rjh
Could I do the revert like this: $agent->cookie_jar->revert;
Phill Pafford
I think this is working, still testing!!!
Phill Pafford
`$agent->cookie_jar->revert` will work, yes.
rjh
+1  A: 

When your user-agent isn't doing something you think it should be doing, compare it's requests with that of an interactive browser. A Firefox plugin are handy for this sort of thing.

You're probably missing part of the process that the server expects. You probably aren't logging in or interacting correctly, and that could be for all sorts of reasons. For instance, there might be JavaScript on the page that WWW::Mechanize isn't handling.

When you can pinpoint what an interactive browser is doing that you are not, you'll know where you need to improve your script.

In your script, you can also watch what is happening by turning on debugging in LWP, which Mech is built on:

 use LWP::Debug qw(+); 

rjh already answered the certificate part of your question.

brian d foy
He's already using LWP::Debug :)
rjh
Ah, missed that. We should then get to see the output.
brian d foy
So, what did you find the problem was, and which part of this answer helped you find it? :)
brian d foy