views:

900

answers:

9

I want to write a hit counter script to keep track of hits on images on a website and the originating IPs. Impressions are upwards of hundreds of thousands per day, so the counters will be incremented many times a second.

I'm looking for a simple, self-hosted method (php, python scripts, etc.). I was thinking of using MySQL to keep track of this, but I'm guessing there's a more efficient way. What are good methods of keeping counters?

A: 

Well if you happen to go the PHP route you could use an SQLite database, however MySQL is a perfectly reasonable way to store that info and usually (at least from the ones I've seen) is how it is done.

If you didn't want to store IP address and any other info a simple number in a text file could work.

Cory
I'm a huge fan of SQLite, and simple text files, but both are unsuitable for this task. Assume more than one server, more than one image, more than one concurrent hit, etc...
Dustin
+5  A: 

A fascinating subject. Incrementing a counter, simple as it may be, just has to be a transaction... meaning, it can lock out the whole DB for longer than makes sense!-) It can easily be the bottleneck for the whole system.

If you need rigorously exact counts but don't need them to be instantly up-to-date, my favorite approach is to append the countable information to a log (switching logs as often as needed for data freshness purposes). Once a log is closed (with thousands of countable events in it), a script can read it and update all that's needed in a single transaction -- maybe not intuitive, but much faster than thousands of single locks.

Then there's extremely-fast counters that are only statistically accurate -- but since you don't say that such imprecision is acceptable, I'm not going to explain them in more depth.

Alex Martelli
Most webserver's provide this log anyway. See my answer.
middus
@middus, sure, but my point is that (even if what you care about isn't logged, or you're on a host that doesn't give you logs access) you can still count via dedicated logs of your own (it can also be much faster to process such logs than general ones, as you can tailor them to counting needs; you can update the DB reasonably often by closing counting-logs much more often than general ones; etc).
Alex Martelli
Yeah, of course you're right. However in this specific case, when you're dealing with images you would have to pipe all the traffic through a script of some sort which is a rather big overhead. (At least I can't see a better option).
middus
@middus, better options might include ensuring that images are served by a separate server process (not hard, by arranging directories appropriately -- they normally need to be served on a separate _domain_ anyway to avoid useless cookie overhead) and configuring that separate server's logging for the purpose. But this is getting into more of a serverfault subthread;-).
Alex Martelli
+2  A: 

There are two really easy ways:

  1. Parse it out of your web logs in batch.
  2. Run the hits through beanstalkd or gearmand and have a worker do the hard stuff in a controlled way.

Option 1 works with off-the-shelf tools. Option 2 requires just a bit of programming, but gives you something closer to realtime updates without causing you to fall over when the traffic spikes (such as you'll find in your direct mysql case).

Dustin
A: 

Not sure if it's up your alley, but AppEngine is a pretty nice platform to build on. Some sample code you can use to build a counter using their DataStore and transactions is described here: http://code.google.com/appengine/docs/python/datastore/transactions.html.

jgeewax
We tried to store lots of counters in app engine, and it turns out to be expensive in CPU time. That is, it costs $$.
dfrankow
+1  A: 

If accuracy is important, you can do it slightly slower with MySql... create a HEAP / Memory table to store your counter values. These a in-memory tables that are blazingly fast. You can write the data into a normal table at intervals.

Based on the app engine ideas, you could use memcache as a temporary store for your counter. Incrementing a memcache counter is faster than using the MySql heap tables (I think). Once every five or ten seconds, you could read the memcache counter and write that number into your DB.

Sudhir Jonathan
+2  A: 

You could take your webserver's Access log (Apache: access.log) and evaluate it time and again (cronjob) in case you do not need to have the data at hand at the exact moment in time when someone visits your site.

Usually, the access.log is generated anyway and contains the requested resource as well as time, date and the user's IP. This way you do not have to route all trafic through a php-script. Lean, mean counting machine.

middus
A: 

You can use Redis - it`s very fast key-value storage with support for atomic increments. If need will arise -- counts data could be splitted between multiple servers easily.

Igor Ivanov
A: 

I've done something very similar, on a similar scale (multiple servers, hundreds of domains, several thousand hits per hour) and log file analysis was definitely the way to go. (It also checked hit rates, weighted them by file type, and blacklisted IP addresses at the firewall if they were making too many requests; its intended purpose was to auto-block bad bots, not to just be a counter, but counting was an essential piece of it.)

No performance impact on the web server process itself, since it's not doing any additional work there, and you could easily publish periodically-updated hit counts by injecting them into the site's database every minute/5 minutes/100 hits/whatever without having to lock the relevant row/table/database (depending on the locking mechanism in use) on every hit.

Dave Sherohman
This sounds awesome, any tips on how to begin creating this? Can it be done with PHP?
swt83
@swt83: I used Perl, but my understanding is that PHP has been extended to run in non-web environments, so it should be possible. For a simple counter, it's basically just a matter of setting up a cron job to run every so often, check the access long for new hits (record the end-of-file position each time so that the next run can pick up from where the last one left off instead of re-processing the whole log), and process/summarize them however you like. If you want it real-time, write it as a `tail -f` type monitor that reads from the file as it's written (just watch out for log rotation).
Dave Sherohman
+1  A: 

Without a doubt, Redis is perfect for this problem. It requires about a minute to setup and install, supports atomic increments, is incredibly fast, has client libs for python and php (and many other languages), is durable (snapshots, journal, replication).

Store each counter to its own key. Then simply

INCR key
z8000