ansaurus

Question

HTML Snapshot for crawler - Understanding how it works

Answer 1

+2 A:

I think you got something wrong so i'll try to explain whats going on here including the background and alternatives. as this indeed a very important topic that most of us stumpled upon (or at least something similar) from time to time.

Using AJAX or rather asynchronous incremental page updating (because most pages actually don't use xml but json), has enriched the web and provided great user experience.

It has however also come at a price.

The main problem were clients that don't supported the xmlhttpget object or javascript at all. In the beginning you had to provide backwards compatibility. This was usually done by providing links and capture the onclick event and fire an AJAX call instead of reloading the page (if the client supported it).

Today almost every client supports the necessary functions.

So the problem today are search engines. Because they dont. Well thats not entirely true because they partly do (expecially google), but for other purposes. Google evaluates certain javascript code to prevent Blackhat SEO (for example a link pointing somewhere but with javascript opening some completely different webpage... Or html keyword codse that are invisible to the client because they are removed by javascript or the other way round).

But keeping it simple its best to think of a search engine crawler of a very basic browser with no CSS or JS support (its the same with css, its party parsed for special reasons).

So if you have "AJAX links" on your website, and the Webcrawler doesnt support following them using javascript, they just don't get crawled. Or do they? Well the answer is javascript links (like document.location whatever) get followed. Google is often intelligent enough to guess the target. But ajax calls are not made. simple because they return partial content and no senseful whole page can be constructed from it as the context is unknown and the unique uri doesnt represent the location of the content.

So there are basically 3 strategies to work around that.

have an onclick event on the links with normal href attribute as fallback (imo the best option as it solves the problem for clients as well as search engines)
submitting the content websites via your sitemap so they get indexed, but completely apart from your site links (usually pages provide a permalink to this urls so that external pages link them for the pagerank)
ajax crawling scheme

the idea is to have your javascript xmlhttpget requests entangled with corresponding href attributes that look like so: www.example.com/ajax.php#!key=value

so the link looks like:

<a href="http://www.example.com/ajax.php#!page=imprint" onclick="handleajax()">go to my imprint</a>

the function handleajax could evaluate the document.location variable to fire the incremental asynchronous page update. its also possible to pass an id or url or whatever.

the crawler however recognises the ajax crawling scheme format and automatically fetches http://www.example.com/ajax.php.php?%23!page=imprint instead of http://www.example.com/ajax.php#!page=imprint so you the query string then contanis the html fragment from which you can tell which partial content has been updated. so you have just have to make sure that http://www.example.com/ajax.php.php?%23!page=imprint returns a full website that just looks like the website should look to the user after the xmlhttpget update has been made.

a very elegant solution is also to pass the a object itself to the handler function which then fetches the same url as the crawler would have fetched using ajax but with additional parameters. your serverside script then decides whether to deliver the whole page or just the partial content.

its a very creativ approach indeed and here comes my personal pr/ con analysis:

pro:

partial updated pages recieve a unique identifier at which point they are fully qualified resources in the semantic web
partially updated websites recieve a nique identifier that can be presented by search engines

con:

its just a fallback solution for search engines, not for clients without javascript
it provides opportunities for black hat seo. so google for sure won't adopt it fully or rank pages with this teqnique high with out proper verification of the content.

conclusion:

just usual links with fallback legacy working href attributes, but an onclick handler are a better approach because they provide fucntionality for old browsers.
the main advantage of the ajax crawling scheme is that partially updated websites get a unique URI, and you don't have to do create duplicate conent that somehow serves as the indexable and linkable counterpart.
you could argue that ajax crawling scheme implementation is more consistent and easier to implement. i think this is a question of your applicaton design.

Joe Hopfgartner 2010-10-08 20:14:59

OHHH! Now is clear how it works! :) Very thanks man!! I didn't understand how crawler got the links from the pages. Now i understand, i put them on href links. Yeah, it should works! That's a great "escamotage"!! but I see it as a nice trick, not a real solution for the SEO with web 2.0. :) But ok, it seems to works! Only 1 last thing : this will works only if browser evalutate the onclick event before the href (but i think that all do this). Thanks Joe, you really help me :)

markzzz 2010-10-11 15:06:28

and sorry for my crap english :)

markzzz 2010-10-11 15:07:09

ansaurus

tags:

views:

answers:

HTML Snapshot for crawler - Understanding how it works

related questions