tags:

views:

157

answers:

3

One section of our website presents a paged randomized content. The first time a new user visits the site, she is assigned a new random seed which passed in URL's and for persistence stored also in cookies. The problem is that the seed in URL's confuses Googlebot (and other indexing services); it complains that there are too many URL's pointing to the same content. It would be possible for us not to pass the seed in URL's, but still even if we use only cookies, it seems to me that there at some point we would have to make a decision whether the visitor is an indexing spider or a human in order to present the content in a non-randomized fashion.

My main question is: How bad would be in this case to detect the most common indexing spiders, and serve them the content in a non-randomized fashion? I know that the number one rule of search optimization is to not to optimize and, if anything, optimize for users and make sure that the content is the same for everybody. But in this case, we would not be actually changing the content or hiding anything.

Has anybody been facing the same problem? What are the best practices to deal with this issue?

+1  A: 

It depends on the site structure, but you might benefit from just editing your robots.txt file to keep bots away from potentially confusing URLs. One more option is generating a Google Sitemap (when we say search engine, we normally mean Google).

A couple of links:

Álvaro G. Vicario
+1  A: 

My main question is: How bad would be in this case to detect the most common indexing spiders, and serve them the content in a non-randomized fashion?

Most (legitimate/search engine) bots set their user-agent correctly, and hence it is very easy to do something like this, you just need to check the User-Agent HTTP request field and react properly...

Whether this solution is the best one, I'm not qualified to debate on.

List of User Agents.

Matthew Scharley
+1  A: 

You are emulating cookie behavior using query parameter. I though such practice ended a long time ago. Best practice now is to use cookies for users which requires session and let other users browse your site anonymously.

Either you are running a site with a lot of paranoid users which don't want to be tracked and thus has turned off cookies. They probably doesn't want to be tracked by url either.

If a user is logged in they must have cookies enabled, no exceptions. If a user is not logged in they may look at your content, but not be tracked.

One trouble with having session in your url is that users copy and paste those a lot more now than they did before, so even if you detect search engines you might end up with links with this session information included in them.

If you really want to solve the problem adding an xml sitemap and spider detecting might be acceptable solutions, but spider detection requires a lot of work to keep up to date.

"Why are we not included in Bing?" - ooh I forgot to add that search engine.

"Why are we not included in Google anymore" - ohh I didn't know google had a new datacenter.

Erwin
As I said above, we can live without the seed in URL since we keep it in cookies anyway. It is in URL more for historical reasons (it was there first) and I guess that the reason for keeping it around now is to allow users to send each other links pointing to the same content. But again, it’s not anything crucial.
Jan Zich
So the query parameter changes some of the content on the page? It is not just there to track users. That is probably the reason search engines have trouble identifying the query parameter as a session parameter. Query parameters detected as session is usually ignored. I would think it's a good business decision to drop this legacy behavior. If you decide to keep the urls then I would try the xml sitemap solution.
Erwin
Yes, I agree - we will probably do it. But it won’t solve the original problem. We will need to somehow detect spiders and make sure that they get non-randomized content.
Jan Zich