Making AJAX Applications Crawlable? How to build a simple web service on Google App Engine to produce HTML Snapshots?

Real World Problem:

I have my app hosted on Heroku, who (to my knowledge) are unable to offer a solution for running a Headless (GUI-less) Browser - such as HTMLUnit - for generating HTML Snapshots for Googlebot to index my AJAX content.

My Proposed Solution:

If you haven't already, I suggest reading Google's Full Specification for Making AJAX Applications Crawlable.

Imagine I have:

a Sinatra app hosted on Heroku on the domain http://example.com
the app has tabs along the top of the page TabA, TabB and TabC
under each tab is SubTab1, SubTab2, SubTab3
onload if the url is http://example.com#!tab=TabA&subtab=SubTab3 then client-side Javascript takes the location.hash and loads in TabA, SubTab3 content via AJAX.

Note: the Hash Bang (#!) is part of the google spec.

I would like to build a simple "web service" hosted on Google App Engine (GAE) that:

Accepts a URL param e.g. http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Runs HTMLUnit to open http://example.com#!tab=TabA&subtab=SubTab3 and run the client-side javascript on the sever.
HTMLUnit returns the DOM once everything is complete (or something like 45 seconds has passed).
The return content could be sent back via JSON/JSONP, or alternatively a URL is return to a file generated and stored on the google app engine server (for file based "cached" results)... open to suggestions here. If a URL to a file was returned then you could CURL to get the source code (aka a HTML Snapshot).

My http://example.com app would need to manage the call to http://htmlsnapshot.appspot.com... basically:

Catch Googlebots call to http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3 (googlebot crawler escapes certain characters e.g. %26 = &).
Send request from the backend to http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Render the returned HTML Snapshot to the frontend.
Google Indexes the content and we rejoice!

I don't have any experience with Google App Engine or Java or HTMLUnit.

I might be able to figure it out... and will post my results if I do.

Otherwise I feel this is a VERY good opportunity for someone to write a kick-ass blog post that outlines a novices step-by-step guide to setting up a web service like this.

This will introduce more people to the excellent (and free!) Google App Engine. Also it will undoubtably encourage more people to adopt Google's specs for crawlable AJAX content... something we can all benefit from!

As Google's specification gains more acceptance the "hurdle" of setting up a Headless Browser is going to send many devs Googling for answers! Get in now with an answer for fame and glory! (edit: at the very least I will sing your praises).

Hit me up on twitter @i_chris_jacob if you would like to discuss solutions.

So HtmlUnit works on GAE now? Any caveats which you know of?

Matt H 2010-08-19 02:13:59

I still have a problem accessing my own application with HTMLunit, which makes it hard for an app to serve itself to the crawler. The details of my issue are subtle but I describe them here (http://bit.ly/bViIMr). I haven't tested this in a while, so maybe the problem went away.

Philippe Beaudoin 2010-08-19 03:05:22

On Amit Manjhi's test app it seems to work fine with the same URL. Maybe this has fixed itself, or maybe it depends on a multitude of factors.

Matt H 2010-08-19 09:56:29

Could be. I wondered for a while if it wasn't some limitation of the free AppEngine account that wouldn't spawn two Servlets in frequent succession.

Philippe Beaudoin 2010-08-19 18:38:20

Thanks Philippe and Matt. I will investigate GWTP and see what results I can come up with. If either of you are interested in working with me on this I think a Headless Brower as a Web Service is an interesting project (and shouldn't be very hard with GAE and GWTP). Today I've been spec'ing out my own solution for building crawlable, accessible, deeplinked AJAX web apps. Leveraging a Headless Browser to generate HTML Snapshots when the client does not have javascript.. "Headless AJAX" I think I'll call it ;). More info soon.

Chris Jacob 2010-08-20 07:22:30

Chris, there is an open issue on GWTP regarding a module to make App Engine based GWTP apps crawlable. It's blocked on the bug I described above but my latest idea, following the proposal here, is to cut the Gordian knot by providing an easy way to build your own Web Service. Maybe you'd like to contribute on this? (Issue and discussion is at: http://code.google.com/p/gwt-platform/issues/detail?id=1)

Philippe Beaudoin 2010-08-22 07:30:07

http://code.google.com/web/ajaxcrawling/docs/getting-started.htmlRead:Step-by-step guide-- 2. Set up your server to handle requests for URLs that contain _escaped_fragment_The crawler will modify each AJAX URL such aswww.example.com/ajax.html#!key=valueto temporarily becomewww.example.com/ajax.html?_escaped_fragment_=key=valueOn your end all you need to do is handle "?_escaped_fragment_=" requests

Chris Jacob 2010-08-26 06:58:51

That doesn't answer my question... last sentence in your post.. "On your end all you need to do is handle '?_escaped_fragment_=' requests"... This is the 'how' I am asking about.. The only two solutions I can come up with are integrate a server side code (PHP) or reverse proxy w/ apache. I am just curious on how other people went about handling it.

Brian McMahan 2010-08-26 18:52:20

ansaurus

tags:

views:

answers:

Making AJAX Applications Crawlable? How to build a simple web service on Google App Engine to produce HTML Snapshots?

related questions