views:

416

answers:

2

I'm trying to construct a regexp that will evaluate to true for User-Agent:s of "browsers navigated by humans", but false for bots. Needless to say the matching will not be exact, but if it gets things right in say 90 % of cases that is more than good enough.

My approach so far is to target the User-Agent string of the the five major desktop browsers (MSIE, Firefox, Chrome, Safari, Opera). Specifically I want the regexp NOT to match if the user-agent is a bot (Googlebot, msnbot, etc.).

Currently I'm using the following regexp which appears to achieve the desired precision:

^(Mozilla.*(Gecko|KHTML|MSIE|Presto|Trident)|Opera).*$

I've observed small number of false negatives which are mostly mobile browsers. The exceptions all match:

(BlackBerry|HTC|LG|MOT|Nokia|NOKIAN|PLAYSTATION|PSP|SAMSUNG|SonyEricsson)

My question is: Given the desired accuracy level, how would you improve the regexp? Can you think of any major false positives or false negatives to the given regexp?

Please note that the question is specifically about regexp-based User-Agent matching. There are a bunch of other approaches to solving this problem, but those are out of the scope of this question.

A: 

You could construct a blacklist by checking which user agents access robots.txt.

Sjoerd
+1  A: 

Many crawlers don’t send an Accept-Language header, while AFAIK all browsers do. You could combine this information with your regex to get more accurate results.

toscho