views:

348

answers:

6

I have a site with millions of URLs. Each time a URL is clicked, a database row corresponding to that URL is updated indicating the timestamp of that click. I would like to, using additional columns for sure, but without the need to insert distinct rows for every click, estimate the number of clicks per hour this URL receives. Some ideas include storing a handful of timestamps that are aligned towards the most recent second, minute, 15 minute and hour intervals (but that idea is fuzzy to me, how that actually gets what we want), or the more nasty solution of serializing a "log" of time deltas in some kind of serialized row.

While a naive approach suggests to measure the time between the current click and the last one to determine a rate, that would only produce a useful estimate if the link is clicked at a very consistent rate. In reality the link could receive a flurry of clicks in one minute and nothing at all for another 20.

the reason I don't want to log each click distinctly is just so that the database isn't weighed down with thousands of additional INSERT statements per hour (and the corresponding DELETEs of data more than an hour old), or alternatively that I don't have to fire up an additional storage system (tokyo tyrant, grepping apache logs, etc.) to log these clicks.

+3  A: 

Have you tried another approach, like an external stats service? maybe Google Analitycs? It could give you the info you're looking for without any extra load on your servers.

Juparave
i suppose I'd have to see if they have an API (naturally we are on them already, their script slowing our site down like mud)
zzzeek
actually yeah, Google's API would provide this pretty directly.
zzzeek
ALthough I'm being told they lag too much to get a decent number for the past hour.
zzzeek
and...I'd have to use chartbeat instead http://blog.chartbeat.com/2009/04/22/41/
zzzeek
+1  A: 

Is there any reason why you've disregarded processing of the apache access logs? They do have the benefit of being timestamped and created automatically by the server and are fairly light-weight. A fairly simple perl or awk script can then keep a running summary of the logs for simple parsing.

Chris J
the complexity there is that they are spread among multiple web servers and are also on log rotation. So the scripts would have to look at the current and potentially previous log files, and also merge among several servers. its also just another moving part to worry about.
zzzeek
+4  A: 

How about storing a counter in memcached, keyed by the URL, and a last_counter_reset_time in the DB?

Memcached has a lightweight atomic incr operation. Call that on each request. Periodically reset the counter, updating the last_counter_reset_time.

I'm no memcached veteran, but I imagine there are ways of being fairly sure that the counters for all your URLs stay cached. There's no persistence so you may lose the counter at any time, but occasional data loss of that kind might be acceptable.

Gunnlaugur Briem
Oh, “or alternatively that I don't have to fire up an additional storage system” — so you might have already excluded memcached.
Gunnlaugur Briem
well I mentioned Tokyo Tyrant since it can be used very much like a "persistent memcached". It provides a "memcached compatilbility" API among others. (continued...)
zzzeek
So I am already storing a single "stats" row per URL in the DB. Suppose the link gets a few hundred hits in an hour. We can see that we are getting 300 req/hour based on last_counter_reset_time being nearly an hour old and the counter is 300. then we reset last_counter_reset time to the current time and the counter back to zero. The next request coming in now sees only a handful of requests, and the timer was recently reset. Our estimate now would be pretty inaccurate compared to before the counter was reset. Do we store also the last "good" estimate ?
zzzeek
Yes, that's the way I imagined it. For that matter you could store a history of request rates over past request-count periods; an insert per request would be terrible but an insert per request-count period is peanuts. The rate estimate then comes out of the latest row for that URL in that request-rate table (or perhaps an average of the last few periods, if you do the reset-and-store at small intervals).
Gunnlaugur Briem
actually yeah keeping a histogram of these rates is part of the project. I have a script running over some log files right now to simulate the general idea with a SQLite database and the results look interesting. I'm going to look in our google analytics history to see how closely what I'm seeing here matches up.
zzzeek
You can always use memcached with the -M switch to disable the automatic removal. Then you have to take care of the removing and checking
jonasl
A: 

First of all, why keep timestamps at all? You could keep exact counts by having one record in the database for each URL, and just incrementing a count each time it is clicked.

If even that is too much load, I would think the next most obvious answer is statistical sampling. Pick a time slice, say ten minutes. For each ten minute slice, pick one URL. Count the number of clicks for that URL. Assume the rate for that ten minutes is consistent and multiply by a constant to get the estimated rate for whatever desired time period. Then for the next ten minute slice pick a different URL. Etc.

Realistically you could probably count more than one URL at a time without overburdening the server, so you could pick some convenient number of URLs, ten or a hundred or whatever your system can handle.

You would also want to consider time of day. If most of your users are in, say, California, then then a URL that is sampled when it is 4:00 pm Pacific time would likely get a much higher number of hits than if it was sampled at 4:00 am. So you'd want to cycle through URLs in a way that insures that when you come back to a given URL it is at a different time of day then when you first sampled it. If your users are evenly spread over the entire world this wouldn't be an issue, but that sounds unlikely.

Jay
we already store the number of clicks. If I give you URL x, and it has 3500 clicks, how many clicks has it received in the past hour ? Those clicks could have been minutes or months ago, or both.
zzzeek
also the "sampling" stuff seems to assume a log of all clicks. That's easy to do I'm trying to avoid storing/parsing logs.
zzzeek
No, I mean, just count the clicks received within a given time frame, not all clicks ever, certainly not keep a log. You just say okay, starting now, start counting clicks for URL x. Pick a few URLs to count in any given time frame, don't count the others.If you have the horsepower to count them all, then just keep a count of all. If you care about time frames, keep them in buckets. Like count how many today. When you hit midnight, save the counts and reset. Unless you really need to know the exact time of every click -- which I gather you don't -- a log is way more work than necessary.
Jay
A: 

This may not be a practical solution, but since you asked for a "clever" way, here is some academic research on a question that is not exactly your problem but could probably be adapted. Some papers in the "Cited by" list might be even closer.

Jouni K. Seppänen
A: 

If you want exact counts, Redis is ideal for this task. It's roughly comparable in speed to memcached, but offers persistence. The persistence is based on forking and writing sequentially to disk, so it avoids the high io load of keeping this sort of information in your database.

If you want a very simple approach: just discard samples in a non-biased way, (ie log_request(foo) if rand(1) < 0.1 to sample 10% of traffic). You'll lose any signals on urls accessed less than the ratio you're subsampling by, but if you're mostly interested in the highly accessed URLS this can be very simple and efficient.

There's more complicated variations on the above scheme where you update a counter with a probability that lessons as the count grows (and then weight counters appropriately via the probability function when reading them), which is a sort of bastardized form of importance sampling. These are almost as simple and preserve counts on the tail of the distribution better.

  • Edit:

Ah, sorry, I see now from the comments that you're after rates for some period of time. The approach I've used for this is basically the same as the sampling/counter, just store individual counters for some time bracket (ie hourly). For keeping long term archives have additional summary tables for larger time periods (daily, weekly) that a batch job populates from the fine grained (hourly) table, letting you delete old data from the fine grained table.

RRDTool is a more general implementation of this idea, and several OSS monitoring solutions use it.

Jason Watkins
what I have now is, the number of hits in a certain time bracket, the time that bracket started, and the hits / time stored from the last time bracket that was more than N minutes old. The "weighting" you refer to is something I might need to figure out. Suppose we sampled 300 hits over the past hour. Now we've reset our window, 2 minutes have gone by, and we've gotten 50 hits already in those 2 minutes. How can we estimate hits/hour, taking into account the previous hour's greater accuracy and the current 2 minutes greater timeliness ?
zzzeek