tags:

views:

639

answers:

4

Our web analytics package includes detailed information about user's activity within a page, and we show (click/scroll/interaction) visualizations in an overlay atop the web page. Currently this is an IFrame containing a live rendering of the page.

Since pages change over time, older data no longer corresponds to the current layout of the page. We would like to run a spider to occasionally take snapshots of the pages, allowing us to maintain a record of interactions with various versions of the page.

We have a working implementation of this (Linux), but the snapshot process is a hideous Python/JavaScript/HTML hack which opens a Firefox window, screenshotting and scrolling and merging and saving to a file. This requires us to install the X stack on our normally headless servers, and takes over a minute per page.

We would prefer a headless implementation with performance closer to that of the rendering time in a regular web browser, but haven't found anything.

There's some movement towards building something using Mozilla source as a starting point, but that seems like overkill to me, as well as a maintenance nightmare if we try to keep it up to date.

Suggestions?

+1  A: 

An article on Digital Inspiration points towards CutyCapt which is cross-platform and uses the Webkit rendering engine as well as IECapt which uses the present IE rendering engine and requires Windows, natch. Nothing off the top of my head which uses Gecko, Firefox's rendering engine.

I doubt you're going to be able to get away from X, however. Since CutyCapt requires Qt, it requires either X or a Windows installation. And, similarly, IECapt will require Windows (or Wine if you want to try to run it under Linux, and then you're back to needing X). I doubt you'll be able to find a rendering engine which doesn't require Qt, Gtk, GDI, or Cocoa, and therefore requires a full install of display libraries.

Conspicuous Compiler
It works with Xvfb.
jrockway
@jrockway: I'm not sure what your antecedent is, but I think you might be missing the point. The objection here isn't the fact that a physical screen is needed (it isn't), but more that a massive amount of additional libraries are installed which support graphical interfaces on a machine which is otherwise used for only terminal services.
Conspicuous Compiler
A: 

Why not store the HTML that is sent out to the client? You could then use that to redisplay in a webbrowser as a page to show what it looked like.

Using your webanalytics data about use actions, you could they use that to default the combo boxes, fields etc to the values the client would have had, even change the CSS on buttons, etc, to mark them as being pushed.

As a benefit, you don't need the X stack, don't need to do any crawling or storing of images.

EDIT (Re Andrew Moore):

This is were you store the current CSS/images under a version number. Place an easily parsable version number in a comment in the HTML. If you change your CSS/images and use the existing names, increment the version number in the HTML output sent out.

The system that stores the HTML will know that it needs to grab a new copy and store under a new number. When redisplaying, it simply uses the version number to determine which CSS/image set to use.


We currently have a system here that uses a very similiar system so we can track users actions and provide better support when they call our help desk, as they can bring up the users session and follow what they did, even some-what live.

you can even code it to auto-censor sensitive fields when it is stored.

Dan McGrath
That works until the day they change their layout and their css/images drastically.
Andrew Moore
Considering your edit. Now you have the added problem of parsing the files and correcting any relative/absolute paths so they display properly. The image route is simply the easiest.
Andrew Moore
That is correct, but it is not that difficult. I fail to see how rendering a page and taking an image of, is really the easiest way. At worst, you could store all the CSS with each user session, and just make sure if you change an image, you also change its name. Or just make sure you use everything via a relative path in the first place, which means you don't need to change the pathing in the HTML at all if you serve it correctly. We did it here and aside from some initial db issues, it works like a charm.
Dan McGrath
ryandenki
+3  A: 

I use wkhtmltopdf for this. It needs an X server, but Xvfb suffices, so it is technically headless.

jrockway
ryandenki
We worked up a hack prototype based on this code, looks like it will work. Thanks for the pointer!
ryandenki
A: 

depending on the specifics of your needs perhaps you could get away with using one of the many free webpage thumbnail services? snapcasa, for example lets you generate thousands per month / no charge no advertizing .. (not ever used, just googled 'free thumbnail service') to find this.

just a thot

Scott Evernden