tags:

views:

27

answers:

2

I wish to programmatically download a webpage which requires a log in to view. Is there any sane way of doing this? By looking at HTTP headers and such, I can see the username / password being passed as POST data, but requesting a page with this info attached isn't good enough. I think cookies are involved too, and it looks like they contain some kind of encrypted authorisation data.

Is there any way of faking this? Language isn't too important here, but something like Perl that can be run on Linux with relative ease would be nice. Or maybe a command line browser could be scripted?

+1  A: 

Php's CURL would do it. Also check here if this solution is right for you.

Iznogood
+1  A: 

Yes, you can do this via the curl command-line tool or the CURL library. You need to figure out what's supposed to be in the cookies, and then pass them with curl's -b option or the equivalent CURL API.

You can also perform HTTP Basic authentication via CURL.

If the page is really sophisticated, you'll have to do HTML parsing or even JS interpretation to extract the cookie data beforehand. That's still doable, but not with CURL alone.

As a general note, anything a web browser can do can be scripted. Turing-completeness and all that. "Unscriptable" captive portals like BlueSocket sells are a load of bunk; they're basically just obfuscated web pages. They'll slow you down but can never, ever stop you - they have to give you the keys in order to work!

Borealid
It's my university's timetable info I need it for, which they took down about 5 minutes after I posted the question until term starts. Have to wait 'til then to try it now, but this looks like it should work thank you.
tsv