




This is a general question about writing web apps.

I have an application that counts page views of articles as well as a url shortner script that I've installed for a client of mine. The problem is that, whenever bots hit the site, they tend to inflate the page views.

Does anyone have an idea on how to go about eliminating bot views from the view count of these applications?

+2  A: 

Check User-Agent. Use this header value to distinguish bots from regular browsers/users.

For example,

Google bot:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)


Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; lv-lv) AppleWebKit/531.22.7 (KHTML, like Gecko) Version/4.0.5 Safari/531.22.7
+4  A: 

There are a few ways you could determine whether your articles are being viewed by an actual user or by a search engine bot. Probably the best way is to check the User-Agent header sent by the browser (or bot). The User-Agent header is essentially a field that is sent identifying the client application used to access the resource. For example, Internet Explorer might send something Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US). Google's bot might send something like Googlebot/2.1 (+http://www.google.com/bot.html). It is possible to send a fake User-Agent header, but I can't see the average site user or a major company like Google doing that. If it's blank or a common User-Agent string associated with a commercial bot, it's most likely a bot.

While you're at it, you may want to make sure you have an up-to-date robots.txt file. It's a simple text file that provides rules automated bots should respect in terms of which content they are not allowed to retrieve for indexing.

Here's a few resources that may be helpful:
