tags:

views:

269

answers:

3

Hi,

I've been struggling to find an exmample of some C# code (I'm using C# Visual Studio 2008 express) that can programmatically save an entire web page (given a URL) including the images and formatting (e.g. CSS). The intension is that in a subsequent I'd ship this off (not sure how yet) so it could be view later via a browser.

Does anyone have an example of the most simple approach (leveraging dotnet framework methods) to save an entire web page? Saving as one page with a subdirectory for images, or otherwise. Basically the same as what you get with browsers when you say "save entire web page".

Thanks

+1  A: 

From code Project: ZetaWebSpider

Tzury Bar Yochay
Greg
+2  A: 

The simplest way is probably to add a WebBrowser control to your application and point it at the page you want to save using the Navigate() method.

Then, when the document has loaded, call the ShowSaveAsDialog method. The user can then save the page as a single file, or a file with images in a subdirectory.

[Update]

Having now noticed "programatically" in your question, the above approach is not ideal as it requires either user involvement or delving into the Windows API to send input using SendKeys or similar.

There is nothing built-in to the .NET Framework that does this all of what you ask.

So my approach revised would be:

  • Use System.NET.HttpWebRequest to get the main html document as a string or stream (easy).
  • Load this into a HTMLAgilityPack document where you can now easily query the document to get lists of all image elements, stylesheet links etc.
  • Then make a separate web request for each of these files and save them to a subdirectory.
  • Finally update all relevent links in the main page to point to the items in the subdirectory.

In effect you would be implementing a very simple web browser. You may run into issues with pages that use Javascript to dynamically alter or request page content, but for most pages this should give acceptable results.

Ash
Almost there...Now we just have to do a SendKeys to fill in the file name and push the OK button. Yuck.
Robert Harvey
+1 I would go with this approach. One thing which you may want to look at is whether you would want to save it as a MHTML (Mime HTML) document. You will have some additional work, but no need for sub-directories.
Rohit
thanks for the pointer to "HTMLAgilityPack" - I'll dig into this. It seems to be a very highly used library - is it like the defacto standard for working with HTML for .net developers?
Greg
A: 

It's definately not elegant, but you could navigate a System.Windows.Forms.WebBrowser to the URL and then call its ShowSaveAsDiagog() method to save the page.

STW
@Yoooder, try to read other answers before posting an answer identical to a existing one!
Ash
I did. I had mine in-the-works and was double-checking the method name on MSDN, didn't see yours until after I'd posted.
STW