views:

74

answers:

4

I need to migrate our website from a proprietary CMS that uses active server pages. Is there a tool or technique that will help download the resources from the existing site? I guess I'm looking for a tool that will crawl and scrape the entire site.

An additional challenge is that the site uses SSL and is protected with forms-based authentication. I have the necessary credentials and I can grab the cookie that validates the session but I'm not sure where to go from here and I don't want to reinvent the wheel if existing tools can help me.

EDIT - I'm using Windows OS

+1  A: 
wget --http-user:username --http-pass:password -r http://yoursite.com

This will fetch the entire site (recursively). If you're on windows, you'll want to install cygwin or something similar to use it, though I believe there are windows versions/clones of wget that you can download.

Lance Kidwell
+1  A: 

If you know Perl, you might like WWW::Mechanize. Depends on the level of automation you are trying to achieve – wget would probably do just fine for some cases.

zoul
+1  A: 

You have a lot of options. One thing to consider is how complex the authentication is. Besides wget, you can look at curl (a very robust option with bindings for many different languages), Python's urllib, Apache HttpClient, WWW-Mechanize, etc.

Matthew Flaschen
+3  A: 

wget may be a good tool for you to use

wget --load-cookies cookies.txt --mirror --page-requisites http://example.com/

add --convert-links if you wish to make it more suitably for a local archive, rather than something you can re-upload somewhere.

A windows version of wget is available from the gnuwin32 project on sourceforge.net http://gnuwin32.sourceforge.net/packages/wget.htm

JensenDied