tags:

views:

243

answers:

3

I found this question very interesting : Programmatic Bot Detection I have a very similar question, but I'm not bothered about 'badly behaved bots'.

I am tracking (in addition to google analytics) the following per visit :

  • Entry URL
  • Referer
  • UserAgent
  • Adwords (by means of query string)
  • Whether or not the user made a purchase
  • etc.

The problem is that to calculate any kind of conversion rate I'm ending up with lots of 'bot' visits that are greatly skewing my results.

I'd like to ignore as many as possible bot visits, but I want a solution that I don't need to monitor too closely, and that won't in itself be a performance hog and preferably still work if someone has javascript disabled.

Are there good published lists of the top 100 bots or so? I did find a list at http://www.user-agents.org/ but that appears to contain hundreds if not thousands of bots. I don't want to check every referer against thousands of links.

Here is the current googlebot UserAgent. How often does it change?

 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
+1  A: 

You could try importing the Robots database off robotstxt.org and using that to filter out requests from those User-Agents. Might not be much different to User-agents.org, but at least the robotstxt.org list is 'owner-submitted' (supposedly).

That site also links to botsvsbrowsers.com although I don't immediately see a downloadable version of their data.

Also, you said

I don't want to check every referer against thousands of links.

which is fair enough - but if runtime performance is a concern, just 'log' every request and filter them out as a post-process (an overnight batch, or as part of the reporting queries).

This point also confuses me a bit

preferably still work if someone has javascript disabled.

are you writing your log on the server-side as part of every page you serve? javascript should not make any difference in this case (although obviously those with javascript disabled will not get reported via Google Analytics).

p.s. having mentioned robotstxt.org, it's worth remembering that well-behaved robots will request /robots.txt from your website root. Perhaps you could use that knowledge to your advantage - by logging/notifying you of possible robot User-Agents that you might want to exclude (although I wouldn't automatically exclude that UA in case a regular web user types /robots.txt into their browser, which might cause your code to ignore real people). I don't think that would cause too much maintenance overhead over time...

CraigD
A: 

I realized that its probably actually easier to do the exact reverse of what I was attempting.

i.e.

select count(*) as count, useragent from sessionvisit 
where useragent not like '%firefox%' 
and useragent not like '%chrome%'
and useragent not like '%safari%'
and useragent not like '%msie%'
and useragent not like '%gecko%'
and useragent not like '%opera%'
group by useragent order by count desc

What I'm actually trying to do is get an accurate conversion rate, and it seems to make more sense to include good browsers rather than exclude bots (good or bad).

In addition if i ever find a 'session' where a 'robot' has made a purchase it probably means there is a new browser (think chrome). Currently none of my robots have made a purchase!

Simon_Weaver
A: 

There is a bot/webcrawler detection API available at www.atlbl.com