views:

548

answers:

8

I am looking to roll my own simple web stats script.

The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).

Is there any open service that does that, like Akismet does for spam? Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?

To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just want to exclude as many as I can from my stats. In know that parsing the user-Agent is an option but maintaining the patterns to parse for is a lot of work. My question is whether there is any project or service that does that already.

+6  A: 

The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.

Yacoby
Hmm. Could it be that easy? But then, there are user agents like wget or getleft that would be nice to recognize as well. Still - +1
Pekka
The legitimate ones do. The bad ones (e.g., email harvesters) will just hijack a useragent string from a web browser.
Bob Kaufman
And the ones that don't probably doesn't want you to know they are a bot anyways.
Svish
Sorry I can't accept two answers!
Pekka
A: 

=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.

Original answer follows (getting negative ratings!)

The only reliable way to tell bots from humans are [CAPTCHAS][1]. You can use [reCAPTCHA][2] if it suits you.

[1]: http://en.wikipedia.org/wiki/Captcha
[2]: http://recaptcha.net/

Ast Derek
See my clarification in the question above.
Pekka
=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.
Ast Derek
Nice idea, have not heard of that before! :)
Pekka
You could call it a honeypot: http://www.slightlyshadyseo.com/index.php/dynamic-crawler-identification-101-trapping-the-bots/
Frank Farmer
I called it HoneyPot www.magentaderek.com/guestbook/
Ast Derek
+6  A: 

To start with, if your software is gonna work Javascript based, the majority of bots will be automatically stripped out as bots, generally, don't have Javascript.

Nevertheless, the straight answer to your question is to follow a bot list and add their user-agent to the filtering list.

Take a look at this bot list.

This user-agent list is also pretty good. Just strip out all the B's and your set.

Hope it helps!

Frankie
Depending on the market you are aiming at, neither do a lot of users. A lot of firefox users tend to use NoScript.
Yacoby
The bot lists look good. Maybe a combined JS / botlist solution, with a frequent list update, is the way to go. Cheers!
Pekka
NoScript also means, no StackOverflow, no Gmail, Reader, Maps, Facebook, YouTube and so on... I use NoScript all the time to check my own sites for spiders and bots, but nowadays doesn't make much sense to use NoScript. Just my opinion.
Frankie
I wonder what made you to edit this old answer.
Col. Shrapnel
@Col. Shrapnel: lol he just added a comma.
Cam
@Col. It's just like Jeff puts it, always trying to suck a bit less... re-read it yesterday and though the comma would make it easier to read! :)
Frankie
A: 

Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.

neoneye
That is a clever idea as well. Will think about that, maybe in combination with the others.
Pekka
We do this on each page (with a parameter for the ID of the page's log entry, and use it to establish/log "rendering time")
Kristen
Many bots index images as well as HTML.
RickNZ
+1  A: 

Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers --at least the content type and cache control--, but write an empty image out).

Some bots parses JS, but certainly no one loads CSS images. One pitfall is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population.

BalusC
Interesting idea.
Pekka
Might the CSS background image be cached on subsequent visits and not re-requested?
Kristen
@Kristen: not if you add no-cache headers.
BalusC
A good idea... until everyone does this. :)
carl
A: 

Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)

Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).

Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!

Kristen
A: 

I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.

A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR

These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.

Here's some more background

MTS
from the question (half a year old, BTW): `To clarify: I'm not looking to block bots`.
Col. Shrapnel
I was just being jokey. I thought people might enjoy PARRY and the DOCTOR. It's pretty hilarious, especially that it was published as an RFC.
MTS
A: 

Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.

neoneye