views:

114

answers:

2

I am doing my own visitor tracking with special features that Google Analytics (nor any other) can provide me as it is customized. I was calling this function near the end of my script, but quickly ran into our clients running into thousands of pages being called from bots (I assume Google), and my table filled up with around 1,000,000 useless and deceptive records in the period of a month.

The method most people use is they use Javascript at the bottom of the page, bots don't operate javascript and so this is an easy fix -- but I am looking for PHP solution.

The last thing I did was use PHP's get_browser:

http://us2.php.net/manual/en/function.get-browser.php

and check for the crawler aspect. This didn't work.

I have looked at this post: http://stackoverflow.com/questions/450835/how-do-you-stop-scripters-from-slamming-your-website-hundreds-of-times-a-second

But the main solution to that was doing something similar to SO where it brings up a CAPTCHA. My point is not to stop the bots -- I want the pages crawled. I simply don't want to send my visitor tracking data when they are there.

I switched to Javascript solution right now, performing an AJAX request, as our users were getting irritated and the inaccurate statistics.

So, is there a reliable way to do this in PHP?

+1  A: 

I've never used that function before - interesting.

Now, all the major search engines will declare themselves with a distinct User-Agent header, which is where I assume this function is getting most of its information from - it's probably matching the User-Agent value against a lookup table, and it could be that newer indexers are not being identified correctly.

You could write your own list, and test the $_SERVER['HTTP_USER_AGENT'] superglobal against that - but you'd have to monitor for updates.

It also won't stop bad or malicious indexers, since they would tend to disguise themselves as a normal browser (just like any other header from the client, User-Agent is not to be trusted).

HorusKol
Making this the accepted solution as it seems to be the best way and definitely the best answer, though I don't consider it completely reliable. And yes, the index I used was supposed to be recent, but oh well.
Kerry
A: 

There is an API (with PHP examples) at www.atlbl.com that you can use to identify normal users, from web spiders, and other bad webbots.