tags:

views:

237

answers:

3

I am trying to use httrack (http://www.httrack.com/) in order to download a single page, not the entire site. So, for example, when using httrack in order to download www.google.com it should only download the html found under www.google.com along with all stylesheets, images and javascript and not follow any links to images.google.com, labs.google.com or www.google.com/subdir/ etc.

I tried the -w option but that did not make any difference.

What would be the right command?

EDIT

I tried using httrack "http://www.google.com/" -O "./www.google.com" "http://www.google.com/" -v -s0 --depth=1 but then it wont copy any images.

What I basically want is just downloading the index file of that domain along with all assets, but not the content of any external or internal links.

A: 

The purpose of HTTTrack is to follow links. Try setting --ext-depth=0.

Gregory Pakosz
A: 

Looking at the example:

httrack "http://www.all.net/" -O "/tmp/www.all.net" "+*.all.net/*" -v

The last part is a regex. Just make a completely matching regex.

httrack "http://www.google.com.au/" -O "/tmp/www.google.com.au" "+*.google.com.au/*" -v ---depth=2 --ext-depth=2

I had to localise, otherwise I get a redirect page. You should localise to whichever google you get directed to.

Nazarius Kappertaal
That helped, but was not quite right. Could you please see my edit?
Max
This seems to copy images, and the js.
Nazarius Kappertaal
A: 

Could you use wget instead of httrack? wget -p will download a single page and all of its “prerequisites” (images, stylesheets).

Kevin Reid
wget would be my fallback solution if httrack cant do the job.
Max