views:

44

answers:

4

Hi,

I have a site with some restricted content. I want my site to appear in search results, but I do not want it to get public.

Is there a way by which I can allow crawlers to crawl through my site but prevent them from making it public?

The closest solution I have found is Google First Click Free but even it requires me to show the content for the first time.

+3  A: 

Why do you want to allow people to search for a page that they can't access if they click the link? Its technically possible to make it difficult (check in your authentication code if useragent contains 'googlebot', though there is nothing stopping people from faking this useragent if they want your content bad enough) but largely pointless.

Also google's official line (IIRC, can't find this anywhere though) is that you may be penalized for deliberately trying to show googlebot different content to what human users see.

tobyodavies
+1 on the google policy note. They consider it spam (or worse) to show content to google that users don't see. "Don't deceive your users or present different content to search engines than you display to users". (from http://www.google.com/support/webmasters/bin/answer.py?answer=35769#quality )
Sean Reilly
+1  A: 

Not really.

You could set a cookie for requests coming from known search engines, and allow those requests to access your content, however that will not prevent people from spoofing their request, or using something like google translate to proxy the information out.

Michael MacDonald
+1  A: 

google custom search engine has it's own index. http://www.google.com/cse/manage/create so you could basically push all you sites to google custom search via on demand indexing http://www.google.com/support/customsearch/bin/topic.py?hl=en&topic=16792 and shortly thereafter block the real googlebot from accessing it again and/or kicking it out via google webmaster tools.

but that would be a lot of hacking and your site will escape into the wild propably somtime (or you will kick it out of the ondemand index somtimes).

and/or you could buy your own little google (called google enterprise) http://www.google.com/enterprise/search/index.html then your google can access it, but it won't get pub. available.

but reading your questions again: that is propably not what you want? isn't it?

Franz
+2  A: 

You're pretty much locked into the Google First Click Free. Your only other solution is to risk violating their Webmaster rules.

If you do use the Google First Click Free, you can protect some of your content. One way is to paginate longer articles or forums and not allow the additional content to be crawled. Users looking for the rest of the content can then be prompted to register for your site.

A more advanced way is to allow all your content to be crawled and indexed. Through analytics identify your more valuable content; then let Google know that you don't want the "additional" or ancillary pages crawled any more (via rel=, meta robots, x-robots, etc). Make sure you also noarchive those pages so people can't back door access the content via Google Cache. You've effectively allow users to get the main content, but if they want to read more they'll have to register to gain access.

This could be viewed as "gray"-hat since you're really not violating any of the webmaster guidelines, but you are creating an implementation that's not common. You're not serving up different content to the users, you're explicitly telling Google what you do and do not want crawled, and you're protecting the value of your site at the same time.

Of course a system like this isn't that easy to automate, but if you look around, you'll see publications or certain forums / message boards doing something similar.

phaithful