tags:

views:

766

answers:

4

When a user clicks a link to download a file on my website, they go to this PHP file which increments a download counter for that file and then header()-redirects them to the actual file. I suspect that bots are following the download link, however, so the number of downloads is inaccurate.

  • How do I let bots know that they shouldn't follow the link?
  • Is there a way to detect most bots?
  • Is there a better way to count the number of downloads a file gets?
+10  A: 

robots.txt: http://www.robotstxt.org/robotstxt.html

Not all bots respect it, but most do. If you really want to prevent access via bots, make the link to it a POST instead of a GET. Bots will not follow POST urls. (I.E., use a small form that posts back to the site that takes you to the URL in question.)

Godeke
+4  A: 

I would think Godeke's robots.txt answer would be sufficient. If you absolutely cannot have the bots up your counter, then I would recommend using the robots file in conjunction with not not incrementing the clicks with some common robot user agents.

Neither way is perfect., but the mixture of the two is probably a little more strict. If is was me, I would probably just stick to the robots file though, since it is easy and probably the most effective solution.

gpojd
+2  A: 

Godeke is right, robots.txt is the first thing to do to keep the bots from downloading.

Regarding the counting, this is really a web analytics problem. Are you not keeping your www access logs and running them through an analytics program like Webalizer or AWStats (or fancy alternatives like Webtrends or Urchin)? To me that's the way to go for collecting this sort of info, because it's easy and there's no PHP, redirect or other performance hit when the user's downloading the file. You're just using the Apache logs that you're keeping anyway. (And grep -c will give you the quick 'n' dirty count on a particular file or wildcard pattern.)

You can configure your stats software to ignore hits by bots, or specific user agents and other criteria (and if you change your criteria later on, you just reprocess the old log data). Of course, this does require you have all your old logs, so if you've been tossing them with something like logrotate you'll have to start out without any historical data.

joelhardi
+1  A: 

You can also detect malicious bots, which wouldn't respect robots.txt using http://www.bad-behavior.ioerror.us/.

phjr