views:

160

answers:

3

I am bouncing between posting this here and on Superuser. Please excuse me if you feel this does not belong here.

I am observing the behavior described here - Googlebot is requesting random urls on my site, like aecgeqfx.html or sutwjemebk.html. I am sure that I am not linking these urls from anywhere on my site.

I suspect this may be google probing how we handle non existent content - to cite from an answer to the linked question:

 [google is requesting random urls to] see if your site correctly 
 handles non-existent files (by returning a 404 response header)

We have a custom page for nonexistent content - a styled page saying "Content not found, if you believe you got here by error, please contact us", with a few internal links, served (naturally) with a 200 OK. The URL is served directly (no redirection to a single url).

I am afraid this may discriminate the site at google - they may not interpret the user friendly page as a 404 - not found and may think we are trying to fake something and provide duplicate content.

How should I proceed to ensure that google will not think the site is bogus while providing user friendly message to users in case they click on dead links by accident?

+2  A: 

You can still send a 404 status and provide user-friendly messages for dead links in the same response. Even "normal users" should get the 404 status even if the page doesn't look like your typical failure page. How you intercept the request depends on your webserver. That's going to be a lot easier than detecting the user-agent and doing something different for Googlebot.

brian d foy
+1  A: 

Use errordocument in apache

ErrorDocument 500 http://foo.example.com/cgi-bin/tester
ErrorDocument 404 /cgi-bin/bad_urls.pl
ErrorDocument 401 /subscription_info.html
ErrorDocument 403 "Sorry can't allow you access today"

The error document can be whatever you would like. Ex if you are using PHP you can create a file called error404.php like this:

<?php
header("HTTP/1.0 404 Not Found");

echo 'Hi, this page does not exist...<img src="nice-logo.png" alt="logo" />'


?>

The only thing that is important is that the response must include a correct 404 code in the header - outputted by Apache, PHP or any other dynamic script.

Example of funny 404 : http://www.northernbrewer.com/brewing/weekly_fermenterd

PHP_Jedi
+5  A: 

The best practice would be to return the user friendly 404 page with a 404 response code, not a 200. Your web server should handle this for you relatively easily.

JacobM
thanks, I did not know about this. I am going to learn how to return 404 while still serving some content
Marek
How to return 404 status code while serving content in ASP.NET MVC: Response.StatusCode = 404; Response.TrySkipIisCustomErrors = true; return View();
Marek