views:

368

answers:

3

I would like to save a web page programmatically.

I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing.

The intended usage is a personal bookmarks application, in which link content is cached in case the original copy is taken down.

+5  A: 

Take a look at wget, specifically the -p flag

−p  −−page−requisites
This option causes Wget to download all the files
that are necessary to properly display
a givenHTML  page. Thisincludes such
things as inlined images, sounds, and
referenced stylesheets.

The following command:

wget -p http://<site>/1.html

Will download page.html and all files it requires.

Josh
And why did someone downvote me? I mean the -1 doesn't bother me so much as I'd like to correct any issues there might be with my answer...
Josh
This looks pretty good, except sometimes the output doesn't look the same as the page that I copied. For example, I tried to 'wget -p' http://ffffound.com/image/3d3795b5447291980a40f3719dea4b5b15ff3ec9However, the related images which are laid out as a horizontal list, now become a long vertical list, one-per-line. Why?
Joseph Turian
+2  A: 

On Windows: you can run IE as a com object and pull everything out.

On other thing, you can take the source of Mozilla.

In Java, Lobo.

Or commons-httpclient and write a lot of code.

bmargulies
+1 if you need stuff like background images referenced in stylesheets and CSS imports, you need a real-world HTML and CSS parser. That's half a browser there already, so you might as well just do it with a real browser. Easiest to embed IE, or work as a Firefox extension.
bobince
A: 

You could try the MHTML format (which is what IE uses). http://en.wikipedia.org/wiki/MHTML

In other words, you'd be downloading each object (image, css, etc.) to your computer, and then "embedding" them, via Base64, into a single file.

Michael Todd
How do I program it?
Joseph Turian
What programming language do you want to use?
Michael Todd
Here's one that uses VB: http://www.codeproject.com/KB/aspnet/aspnethtml2mht.aspx
Michael Todd