tags:

views:

659

answers:

4

This is my first post on SO, so be gentle. I'm not even sure if this belongs here, but here goes.

I want to access some information on one of my personal accounts. The website is poorly written and requires me to manually input the date I want the information for. It is truly a pain. I have been looking for an excuse to learn more Perl so I thought this would be a great opportunity. My plan was to write a Perl script that would login to my account and query the information for me. However, I got stuck pretty quickly.

my $ua = LWP::UserAgent->new;
my $url = url 'https://account.web.site';
my $res = $ua->request(GET $url);

The resulting web page basically says that my web browser is not supported. I tried a number of different values for

$ua->agent("");

but nothing nothings seems to work. Google-ing around suggests this method, but it also says that perl is used for malicious reasons on web sites. Do web sites block this method? Is what I am trying to do even possible? Is there a different language that would be more appropriate? Is what I'm trying to do even legal or even a good idea? Maybe I should just abandon my efforts.

Note that to prevent giving away any private information, the code I wrote here is not the exact code I am using. I hope that was pretty obvious, though.

EDIT: In FireFox, I disabled JavaScript and CSS. I logged in just fine without the "Incompatible browser" error. It doesn't seem to be JavaScript issue.

+3  A: 

You should probably look at WWW::Mechanize, which is a subclass of LWP::UserAgent that is oriented towards that sort of website automation. In particular, see the agent_alias method.

Some websites do block connections based on the User-Agent, but you can set that to whatever you want using Perl. It's possible that a website might also look for other request headers normally generated by a particular browser (like the Accept header) and refuse connections that don't include them, but you can add those headers too, if you figure out what it's looking for.

In general, it's impossible for a website to prevent a different client from impersonating a supported browser. No matter what it's looking for, you can eventually duplicate it.

It's also possible that it's looking for JavaScript support. In that case, you might look at WWW::Scripter, which is a subclass of WWW::Mechanize that adds JavaScript support. It's fairly new and I haven't tried it yet.

cjm
+2  A: 

Getting a different webpage with scraping

We have to make one assumption, the web-server will return the same output if given the same input. With this assumption we inescapably come to the conclusion we're not giving it the same input. There are two browsers, or http clients in this scenario: the one that is giving you the result you want (ex., Firefox, IE, Chrome, or Safari), and the one that is not giving you the result you want (ex., LWP, wget, or cURL).

Kill off the easy possibilities first

Before, continuing firstly make sure the simple UserAgents are the same, you can do this by browsing to whatsmyuseragent.com and setting the UserAgent string in the header of the other browser to whatever that website returns. You can also use Firefox's Web Developer's Toolbar to disable CSS, and JavaScript, Java, and meta-redirects: this will help you track down the problem by killing off the really simple stuff.

Now attempt to duplicate the working browser

Now with Firefox you can use FireBug to analyze the REQUEST that is sent. You can do this under the NET tab in FireBug, different browsers should have tools that can do what FireBug does with FireFox; however, if you don't know the tool in question you can still use tshark or wireshark as described below. It is important to note that tshark and wireshark will always be more accurate because they work at a lower level which at least in my experience leaves less room for error. For example, you'll see things like meta-redirects the browser is doing which sometimes FireBug can lose track of.

After you understand the first web-request that works, do your best to set the second web-request to that of the first. By this I mean setting the request-headers properly and other request elements. If this still doesn't work you have to know what the second browser is doing to see what is wrong.

Troubleshooting

In order to troubleshoot this, we must have a total understanding of the requests from both browsers. The second browser is usually tricker, these are often libraries and non-interactive command line browsers that lack the ability to check the request. If they have the ability to dump the request you might still opt to simply check them anyway. To do this I suggest the wireshark and tshark suite. Immediately, you should be warned that because these operate below the browser. By default, you'll see the actual network (IP) packets, and data-link frames. You can filter out what you need specifically with a command like this.

sudo tshark -i <interface> -f tcp -R "http.request" -V |
perl -ne'print if /^Hypertext/../^Frame/'

This will capture all of the TCP packets, display-filter only the http.requests, then perl filter for only layer 4 HTTP stuff. You might want to add to the display filter to only grab a single web server too -R "http.request and http.host == ''"

You're going to want to check everything to see if the two requests are in line, cookies, GET url, user-agent, etc. Make sure the site doesn't do something goofy.

Updated Jan 23 2010: Based on the new information I would suggest setting Accept, and Accept-Language, Accept-Charset and Accept-Encoding. You can do that with through $ua->default_headers(). If what you demand is a lot more functionality out of your useragent, you can always subclass it. I took this aproach for my GData API, you can find my example on of a UserAgent subclass on github.

Evan Carroll
I disabled CSS and JavaScript and the page loads fine, although a bit funny looking with no CSS. I'll use FireBug and wireshark to find out what LWP is doing differently than my browser and get back to you.
SaulBack
I think I may have bit off a bit more than I can chew. Wireshark catches the packets, but I am not quite sure how to decipher them. Where are the Requests exactly? Wireshark records 49 frames. Some of these are the ack syn handshake, others handle the sSL stuff, and I'm just not sure where I need to look.
SaulBack
With the tshark command I accessed the web page using firefox and the LWP. These were from the firefox output but not the LWP output. Are they configurable in Perl? Cache-Control: max-age=0\r\n Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\n Accept-Encoding: gzip,deflate,sdch\r\n Accept-Language: en-US,en;q=0.8\r\n Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n
SaulBack
I've updated the question. You're still leaving out the url, (which could be wrong), including the protocol (http/https), and the get/post key values, which could be encoded wrong or different. This is also just one request, you could be having a problem with a redirect. You should really run this for one whole request cycle with firefox, and then one whole request cycle with LWP, then paste the headers in question. An HTTP server is very hard to make non-REST, the same input twice should give you the same output. Chances are your input is wrong but we don't have the data to answer this right.
Evan Carroll
You shouldn't really need to capture and interpret packets for this. A modern web browser such as Safari or Firefox (with plugins) will show you everything it did.
brian d foy
I'm saying that most of your advice is bad. You don't need to mess with TCP packets, which is causing problems here.
brian d foy
This isn't your answer this is mine. You're welcome to submit your own, SO encourages it, rather than nagging me and the questioner with your inane opinion. I'm using the packet introspection to check the request LWP is sending. And, that command I gave is direct, and filters out exactly what you need to know.
Evan Carroll
tcpdump is very unlikely to be the right tool.
darch
I didn't suggest tcpdump, and it isn't anywhere in my answer -- the answer you most probably didn't read.
Evan Carroll
A: 

I tried a number of different values for

$ua->agent("");

but nothing nothings seems to work.

Well, would you like to tell us what those things you tried were?

What I normally do is type

javascript:prompt('your agent string is',navigator.userAgent)

into my regular browser's URL bar, hit enter, and cut and paste what it tells me. Surely using wireshark and monitoring actual packets is overkill? The website you're trying to get to has no way of knowing you're using Perl. Just tell it whatever it expects to hear.

AmbroseChapel
A: 

Tools: Firefox with TamperData and LiveHTTPHeaders, Devel::REPL, LWP.

Analysis: In the browser, turn off Javascript and Java, delete any cookies from the target web site, start TamperData logging, log in to web site. Stop TamperData logging and look back through the many requests you likely placed during the login process. Find the first request (the one you made on purpose) and look at its details.

Experimentation: Start re.pl, and start recreating the browser's interaction.

use LWP::UserAgent;

my $ua = LWP::UserAgent->new(
  agent      => $the_UA_of_the_browser,
  cookie_jar => HTTP::Cookies->new(hide_cookie2 => 1),
);
$ua->default_headers(HTTP::Headers->new(
  %the_headers_sent_by_the_browser,
));

my $r = $ua->get($the_URL);
$r->content($r->decoded_content); print $r->as_string;

So that's step one. If you get mismatched responses at any point, you did something wrong. You can usually[1] find out what by looking at $r->request and comparing with the request Firefox sent. The important thing is to remember that there is no magic and that you know everything the server knows. If you can't get the same response to what appears to be the same request, you missed something.

Getting to the first page is usually not enough. You'll likely need to parse forms (with HTML::Form), follow redirects (as configured above, UA does that automatically, but sometimes it pays to turn that off and do it by hand), and try to reverse engineer a weirdly-hacked-together login sequence from the barest of hints. Good luck.

[1]: Except in the case of certain bugs in LWP's cookies implementation that I won't detail here. And even then you can spot it if you know what you're looking for.

darch
But using this method you can never through the same interface tell for sure where the mismatch is between requests. With my answer, using tshark, the same command on both requests should produce the same output if everything is the same.
Evan Carroll