views:

121

answers:

2

I have a web facing, anonymously accessible, blog directory and blogs and I would like to track the number of views each of the blog posts receives.

I want to keep this as simple as possible, accuracy need only be an approximation. This is not for analytics (we have Google for that) and I dont want to do any log analysis to pull out the stats as running background tasks in this environment is tricky and I want the numbers to be as fresh as possible.

My current solution is as follows:

  1. A web control that simply records a view in a table for each GET.
  2. Excludes a list of known web crawlers using a regex and UserAgent string
  3. Provides for the exclusion of certain IP Addresses (known spammers)
  4. Provides for locking down some posts (when the spammers come for it)

This actually seems to do a pretty good job, but a couple of things annoy me. The spammers still hit some posts, thereby skewing the Views. I still have to manually monitor the views an update my list of "bad" IP addresses.

Does anyone have some better suggestions for me? Anyone know how the views on StackOverflow questions are tracked?

+1  A: 

It sounds like your current solution is actually quite good.

We implemented one where the server code which delivered the view content also updated a database table which stored the URL (actually a special ID code for the URL since the URL could change over time) and the view count.

This was actually for a system with user-written posts that others could comment on but it applies equally to the situation where you're the only user creating the posts (if I understand your description correctly).

We had to do the following to minimise (not eliminate, unfortunately) skew.

  • For logged-in users, each user could only add one view point to a post. EVER. NO exceptions.
  • For anonymous users, each IP address could only add one view point to a post each month. This was slightly less reliable as IP addresses could be 'shared' (NAT and so on) from our point of view. The reason we relaxed the "EVER" requirement above was for this sharing reason.
  • The posts themselves were limited to having one view point added per time period (the period started low (say, 10 seconds) and gradually increased (to, say, 5 minutes) so new posts were allowed to accrue views faster, due to their novelty). This took care of most spam-bots, since we found that they tend to attack long after the post has been created.
  • Removal of a spam comment on a post, or a failed attempt to bypass CAPTCHA (see below), automatically added that IP to the blacklist and reduced the view count for that post.
  • If a blacklisted IP hadn't tried to leave a comment in N days (configurable), it was removed from the blacklist. This rule, and the previous rule, minimised the manual intervention in maintaining the blacklist, we only had to monitor responses for spam content.
  • CAPTCHA. This solved a lot of our spam problems, especially since we didn't just rely on OCR-type things (like "what's this word -> 'optionally'); we actually asked questions (like "what's 2 multiplied by half of 8?") that break the dumb character recognition bots. It won't beat the hordes of cheap labour CAPTCHA breakers (unless their maths is really bad :-) but the improvements from no-CAPTCHA were impressive.
  • Logged-in users weren't subject to CAPTCHA but spam got the account immediately deleted, IP blacklisted and their view subtracted from the post.
  • I'm ashamed to admit we didn't actually discount the web crawlers (I hope the client isn't reading this :-). To be honest, they're probably only adding a minimal number of view points each month due to our IP address rule (unless they're swarming us with multiple IP addresses).

So basically, I'm suggested the following as possible improvements. You should, of course, always monitor how they go to see if they're working or not.

  • CAPTCHA.
  • Automatic blacklist updates based on user behaviour.
  • Limiting view count increases from identical IP addresses.
  • Limiting view count increases to a certain rate.

No scheme you choose will be perfect (e.g., our one month rule) but, as long as all posts are following the same rule set, you still get a good comparative value. As you said, accuracy need only be an approximation.

paxdiablo
A: 

Suggestions:

  1. Move the hit count logic from a user control into a base Page class.
  2. Redesign the exclusions list to be dynamically updatable (i.e. store it in a database or even in an xml file)
  3. Record all hits. On a regular interval, have a cron job run through the new hits and determine whether they are included or excluded. If you do the exclusion for each hit, each user has to wait for the matching logic to take place.
  4. Come up with some algorithm to automatically detect spammers/bots and add them to your blacklist. And/Or subscribe to a 3rd party blacklist.
davogones