Scrape current request and zip it up.

views:

answers:

+1 Q:

Scrape current request and zip it up.

I have an asp.net website which contains a few pages that I'd like to export their generated content and send to another service for archiving.

The best way that I can fathom doing this is to grab the stream and dump it to a file which is easy enough to do. My main challenge would be follow the external resources and include them in the zip file. I would like to include stylesheets and images, and images included in the style sheet. I need the stream at request time because the stream that generated is dependent on things like the current session.

I'm wondering also if perhaps all these locations should be normalized, in other words, reroute the references to the same directory with the main document resides.

I can guarantee that all external resources will be located on the same server.

Is this something that can be done with the HtmlAgilityPack? It seemed that I may be able to do a lot of manual work with this utility, but am going to be able to use it query images referenced in stylesheets?

Trying to do some discovery on this topic while completing some other tasks.

Thanks.

May I propose an approach you might consider especially if your goal is to have a record of what the user saw in the browser and not the actual markup their browser was served.

The System.Windows.Forms.WebBrowser class is designed to allow embedding of a browser within a Windows form. Once the control renders a page you can extract it as a bitmap using the DrawToBitmap() method.

If you were to store the page response in the archive you also have to worry about the version of each externally-referenced resource (images, css files, etc) that existed at the time when the page archive was made. Ugh.

Maybe you could implement the WebBrowser in an invisible form created by a Windows service? You would then simply queue the url of each page to be archived to this service which would render the page and add the bitmap to your archive.

Canoehead 2009-07-21 14:02:51

I need the source, it will be parsed by the service I'm passing it to

Dave 2009-07-21 14:37:54

The easiest way to do this is to use an external application to scrape your site and convert all the pages to flat html files. It will not only follow links, but also grab all images/css/javascript files and change any references to them to be document relative. This means you'll have a folder of html/css/js files that are browsable locally. The app I used is called HTTrack - http://www.httrack.com/. I found it works pretty well.

JonoW 2009-07-21 15:18:56

I'm not sure this will work because the page this rendered is dependent on the current session. Can I pass this a stream at request time?

Dave 2009-07-21 15:23:25

Sorry, think I've been dumb. You need to do this dynamically don't you? My solution above is doing it as a manual task, i.e. to make a one "copy" of your site.

JonoW 2009-07-21 15:57:42

No worries. I embellished on my original post a bit because of your response. :)

Dave 2009-07-21 16:10:31

I checked in my source at GitHub if you would like to see how I did this.

My solution isn't perfect but it works for what I need it to do. Some problems that might arise are in the normalization script. HtmlAgility Pack does not emit XHTML, just HTML, so I just used it to find my src and href elements that I wanted to replace, and then I just replaced the found values in the original source with my normalized paths.

Also I've encountered a bug with zip archiving, but I'm not so sure what that issue is yet. If anyone has some improvements that they would like to add, let me know.

Thanks

Dave 2009-10-19 19:40:46

ansaurus

tags:

views:

answers:

Scrape current request and zip it up.

related questions