views:

1770

answers:

10

I would like to do the following. Log into a website, click a couple of specific links, then click a download link. I'd like to run this as either a scheduled task on windows or cron job on Linux. I'm not picky about the language I use, but I'd like this to run with out putting a browser window up on the screen if possible.

A: 

I once did that using the Internet Explorer ActiveX control (WebBrowser, MSHTML). You can instantiate it without making it visible.

This can be done with any language which supports COM (Delphi, VB6, VB.net, C#, C++, ...)

Of course this is a quick-and-dirty solution and might not be appropriate in your situation.

DR
+1  A: 

Except for the auto-download of the file (as that is a dialog box) a win form with the embedded webcontrol will do this.

You could look at Watin and Watin Recorder. They may help with C# code that can login to your website, navigate to a URL and possibly even help automate the file download.

YMMV though.

Wayne
+1  A: 

If the links are known (e.g, you don't have to search the page for them), then you can probably use wget. I believe that it will do the state management across multiple fetches.

If you are a little more enterprising, then I would delve into the new goodies in Python 3.0. They redid the interface to their HTTP stack and, IMHO, have a very nice interface that is susceptible to this type of scripting.

D.Shawley
A: 

You can use Watir with Ruby or Watin with mono.

Paco
A: 

Also you can use Live Http Headers (Firefox extension) to record headers which are sent to site (Login -> Links -> Download Link) and then replicate them with php using fsockopen. Only thing which you'll probably need to variate is the cookie's value which you receive from login page.

Alekc
A: 

libCURL could be used to create something like this.

Adam Pierce
A: 

Can you not just use a download manager?

There's better ones, but FlashGet has browser-integration, and supports authentication. You can login, click a bunch of links and queue them up and schedule the download.

You could write something that, say, acts as a proxy which catches specific links and queues them for later download, or a Javascript bookmarklet that modifies links to go to "http://localhost:1234/download_queuer?url=" + $link.href and have that queue the downloads - but you'd be reinventing the download-manager-wheel, and with authentication it can be more complicated..

Or, if you want the "login, click links" bit to be automated also - look into screen-scraping.. Basically you load the page via a HTTP library, find the download links and download them..

Slightly simplified example, using Python:

import urllib
from BeautifulSoup import BeautifulSoup
src = urllib.urlopen("http://%s:%[email protected]" % ("username", "password"))
soup = BeautifulSoup(src)

for link_tag in soup.findAll("a"):
    link = link_tag["href"]
    filename = link.split("/")[-1] # get everything after last /
    urllib.urlretrieve(link, filename)

That would download every link on example.com, after authenticating with the username/password of "username" and "password". You could, of course, find more specific links using BeautifulSoup's HTML selector's (for example, you could find all links with the class "download", or URL's that start with http://cdn.example.com).

You could do the same in pretty much any language..

dbr
A: 

.NET contains System.Windows.Forms.WebBrowser. You can create an instance of this, send it to a URL, and then easily parse the html on that page. You could then follow any links you found, etc.

I have worked with this object only minimally, so I'm no expert, but if you're already familiar with .NET then it would probably be worth looking into.

theprise
+5  A: 

Check out the HtmlUnit project, implemented in Java:

http://htmlunit.sourceforge.net/

It is exactly what you're looking for; a headless browser, complete with javascript support. You can use it in .Net too if you want, although you'll need to perform an IKVM conversion.

Nathan Ridley
+1, HTMLUnit's JS support is a big plus
orip
+2  A: 

Check out twill, a very convenient scripting language for precisely what you're looking for. From the examples:

setlocal username <your username>
setlocal password <your password>

go http://www.slashdot.org/
formvalue 1 unickname $username
formvalue 1 upasswd $password
submit

code 200     # make sure form submission is correct!

There's also a Python API if you're looking for more flexibility.

orip