views:

174

answers:

3

Hi, After some Google search, I did not find anything fill my need. I want to save the current web page just as what it is. I mean, many web pages has Javascript executed and CSS changed, so after some user interactive, the web page may be different from the one when it is firstly loaded into browser. And I want to save the current web page state to the sever and rendering it in the server. Is there any Javascript library for this task? Thanks!

A: 

Serializing a complete web page is as simple as:

var serialized = document.body.innerHTML;

If you really need the full document, including the head, then:

var serialized =
    '<head>' +
        document.getElementsByTagName('head')[0].innerHTML +
    '</head><body>' +
        document.body.innerHTML +
    '</body>';

Now all you need to do is submit it via AJAX.

About server side rendering, it depends what you mean by rendering. I'm currently using wkhtmltopdf to implement a 'save as pdf' feature on my site. It uses webKit to render the HTML prior to generating the PDF so it fully supports CSS and javascript.

And if you need to save it to an image instead of a PDF file you can always use ghostscript to print the PDF to a JPG/PNG file.

slebetman
Ah, but does element.innerHTML contains the style information of the element?
Yang Bo
Added method to get content of `<head>` as well.
slebetman
But, what if after some user interactivity, some elements' style has been changed by javascript? This is really annoying...
Yang Bo
Then just do the innerHTML **after** the user interactivity. innerHTML is a sort of reference to the browser's HTML compiler/parser. It is not the same as view source.
slebetman
wkhtmltopdf can convert any web page to a PDF file, is there any similar tool which can convert a web page into a PNG/JPG image?
Yang Bo
Not that I'm aware of. However you can use ghostscript to 'print' the PDF file to JPG: `gs -sDEVICE=jpeg -o out-image.jpg webpage.pdf`
slebetman
A: 

Even simpler:

var serialized = document.documentElement.innerHTML

outerHTML instead of innerHTML would be better, but it doesn't work in Firefox.

Let's test it.

>>> document.body.style.color = 'red';
>>> document.documentElement.innerHTML
...
<body style="color: red;">
...
NV
I think the problems for this solution is also, when some elements' style got changed by Javascript, inner/outterHTML can't reflect that changes...
Yang Bo
No, they reflect.
NV
I run 'document.body.color='red';document.documentElement.innerHTML;' in Chrome console, and I got '<head></head><body></body>'.Did you use Firefox or IE?
Yang Bo
`document.body.style.color`, not `document.body.color`. My bad.
NV
Yeah, it works in Chrome/Firefox/IE, thanks!
Yang Bo
A: 

I'm working on something rather similar and wanted to share a summary of what I'm noticing with the innerHTML in IE8, FF3.6, and CHROME 5.0

IE

  • Strips the quotes from around many of the element attributes
  • Singleton nodes aren't self closed
  • If the values on the elements change after the HTML has been loaded, it picks up the new values

FF, CHROME

  • Singleton nodes aren't self closed
  • If the values on the elements change after the HTML has been loaded, it does NOT pick up the new values. It only picks up the default values set in the HTML upon initial rendering.
Zoey