views:

312

answers:

4

I'm doing a very rudimentary tracking of page views by logging url, referral codes, sessions, times etc but finding it's getting bombarded with robots (Google, Yahoo etc). I'm wondering what an effective way is to filter out or not log these statistics?

I've experimented with robot IP lists etc but this isn't foolproof.

Is there some kind of robots.txt, htaccess, PHP server-side code, javascript or other method(s) that can "trick" robots or ignore non-human interaction?

A: 

Well the robots will all use a specific user-agent, so you can just disregard those requests.

But also, if you just use a robots.txt and deny them from visiting; well that will work too.

Noon Silk
Most robots obey the robots texts but other's ignore them, I also want robots to index pages but not my scripts. The user-agents also need complete listing and updating to keep accurate.
Peter
+1  A: 

It depends on what you what to achieve. If you want search bots to stop visiting certain paths/pages you can include them in robots.txt. The majority of well-behaving bots will stop hitting them.

If you want bots to index these paths but you don't want to see them in your reports then you need to implement some filtering logic. E.g. all major bots have a very clear user-agent string (e.g. Googlebot/2.1). You can use these strings to filter these hits out from your reporting.

DmitryK
Have a look here:http://www.useragentstring.com/pages/Crawlerlist/They have a good list of user-agent strings used by search bots.
DmitryK
I want bots to visit all the pages as usual and so the user-agents might be the simplest filtering method. http://www.user-agents.org/ is another source, I guess these still need regular updating and a simple way of filtering through them
Peter
A: 

Don't redescover the weel!

Any statistical tool at the moment filters robots request. You can install AWSTATS (open source) even if you have a shared hosting. If you won't to install a software in your server you can use Google Analytics adding just a script at the end of your pages. Both solutions are very good. In this way you only have to log your errors (500, 404 and 403 are enough).

backslash17
I am already using awstats and Google Analytics but want some alternate options as Analytics has a 1+day delay in showing stats, and I want to track more specific activity not provided by other tools.
Peter
+1  A: 

Just to add - a technique you can employ within your interface would be to use Javascript to encapsulate the actions that lead to certain user-interaction view/counter increments, for a very rudimentary example, a robot will(can) not follow:

<a href="javascript:viewItem(4)">Chicken Farms</a>

function viewItem(id)
{
    window.location.href = 'www.example.com/items?id=' + id + '&from=userclick';
}

To make those clicks easier to track, they might yield a request such as

www.example.com/items?id=4&from=userclick

That would help you reliably track how many times something is 'clicked', but it has obvious drawbacks, and of course it really depends on what you're trying to achieve.

karim79
I assumed most/all robots don't follow JavaScript? The main drawback for using this technique is the intrusive JavaScript and content being inaccessible to users with JavaScript disabled. But this can be easily be fixed with more accessible code then having two tracking systems - visits WITH JavaScript and visits WITHOUT. With aren't robots, without can then be filtered by user-agent...
Peter
@Peter - no they can't follow Javascript, and the it's intrusive and bad for SEO. I just felt the need to point out this technique, as I've known developers use it to hide copious numbers of links, to prevent google from flagging their pages as 'spamdexes'
karim79
@Peter - ...and I don't do that myself :)
karim79