views:

289

answers:

3

We're interested in logging and computing the number of times an item comes up in search or on a list page. With 50k unique visitors a day, we're expecting we could produce 3-4 million 'impressions' per day, which isn't a terribly high amount, but one we'd like to architect well.

We don't need to read this data in real time, but would like to be able to generate daily totals and analyze trends, etc. Similar to a business analytics tool.

We're planning to do this with an Ajax post after the page is rendered - this will allow us to count results even if those results are cached. We can do this in a single post per page, to send a comma delimited list of ids and their positions on the page.

I am hoping there is some sort of design pattern/gem/blog post about this that would help me avoid the common first-timer mistakes that may come up. I also don't really have much experience logging or reading logs.

My current strategy - make something to write events to a log file, and a background job to tally up the results at the end of the day and put the results back into mysql.

+1  A: 

Depending no the action required to list items, you might be able to do it in the controller and save yourself a round trip. You can do it with an after_filter, to make the addition unobtrusive.

This only works if all actions that list items you want to log, require parameters. This is because page caching ignores GET requests with parameters.

Assuming you only want to log search data on the search action.

class ItemsController < ApplicationController
  after_filter :log_searches, :only => :search

  def log_searches
    @items.each do |item|
      # write to log here
    end
  end

  ...
  # rest of controller remains unchanged
  ...
end

Otherwise you're right on track with the AJAX, and an onload remote function.

As for processing the you could use a rake task run by a cron job to collect statistics, and possibly update items for a popularity rating.

Either way you will want to read up on the Ruby Logging class. Learning about cron jobs and rake tasks wouldn't hurt either.

EmFi
This approach is almost always the simplest thing that works (read best if you can get away with it). But… with page caching, or Rack::Cache (which is awesome btw) or Varnish or a CDN or any other external (to ActionController) cache used to achieve oober performance the after_filters do not fire (it's hardly caching if it has to do a rails stack hit!).
cwninja
Yeah, I haven't done much with caching, I just knew that basic caching only kicks in when serving pages without parameters and mentioned that's the only place where the after_filter approach would fit.
EmFi
I am hoping to also catch when the user hits 'back'. The browser caching will stop the request from reaching the server. Ajax requests are sent when the user hits back.
Swards
If that's the case this is not the right answer.
EmFi
+3  A: 

Ok, I have three approaches for you:

1) Queues

In your AJAX Handler, write the simplest method possible (use a Rack Middleware or Rails Metal) to push the query params to a queue. Then, poll the queue and gather the messages.

Queue pushes from a rack middleware are blindingly fast. We use this on a very high traffic site for logging of similar data.

An example rack middleware is below (extracted from our app, can handle request in <2ms or so:

class TrackingMiddleware
  CACHE_BUSTER = {"Cache-Control" => "no-cache, no-store, max-age=0, must-revalidate", "Pragma" => "no-cache", "Expires" => "Fri, 29 Aug 1997 02:14:00 EST"}

  IMAGE_RESPONSE_HEADERS = CACHE_BUSTER.merge("Content-Type" => "image/gif").freeze
  IMAGE_RESPONSE_BODY = [File.open(Rails.root + "public/images/tracker.gif").read].freeze

  def initialize(app)
    @app = app
  end

  def call(env)
    if env["PATH_INFO"] =~ %r{^/track.gif}
      request = Rack::Request.new(env)
      YOUR_QUEUE.push([Time.now, request.GET.symbolize_keys])
      [200, IMAGE_RESPONSE_BODY, IMAGE_RESPONSE_HEADERS]
    else
      @app.call(env)
    end
  end
end

For the queue I'd recommend starling, I've had nothing but good times with it.

On the parsing end, I would use the super-poller toolkit, but I would say that, I wrote it.

2) Logs

Pass all the params along as query params to a static file (/1x1.gif?foo=1&bar=2&baz=3). This will not hit the rails stack and will be blindingly fast.

When you need the data, just parse the log files!

This is the best scaling home brew approach.

3) Google Analytics

Why handle the load when google will do it for you? You would be surprised at how good google analytics is, before you home brew anything, check it out!

This will scale infinitely, because google buys servers faster than you do.


I could rant on this for ages, but I have to go now. Hope this helps!

cwninja
We want to report the results to the content owners. We avoided GA in this case only because we wanted to manipulate the data locally.
Swards
Fair enough, similar with us. Still, if the data is simple enough Rugalytics (http://github.com/robmckinnon/rugalytics) may be of use.
cwninja
A: 

This is what I ultimately did - it was enough for our use for now, and with some simple benchmarking, I feel OK about it. We'll be watching to see how it does in production before we expose the results to our customers.

The components:

class EventsController < ApplicationController
  def create
    logger = Logger.new("#{RAILS_ROOT}/log/impressions/#{Date.today}.log")
    logger.info "#{DateTime.now.strftime} #{params[:ids]}" unless params[:ids].blank?
    render :nothing => true
  end
end

This is called from an ajax call in the site layout...

<% javascript_tag do %>
  var list = '';
  $$('div.item').each(function(item) { list += item.id + ','; });
  <%= remote_function(:url => { :controller => :events, :action => :create}, :with => "'ids=' + list" ) %>
<% end %>

Then I made a rake task to import these rows of comma delimited ids into the db. This is run the following day:

desc "Calculate impressions"
task :count_impressions => :environment do
  date = ENV['DATE'] || (Date.today - 1).to_s # defaults to yesterday (yyyy-mm-dd)
  file = File.new("log/impressions/#{date}.log", "r")
  item_impressions = {}
  while (line = file.gets)
    ids_string = line.split(' ')[1]
    next unless ids_string
    ids = ids_string.split(',')
    ids.each {|i| item_impressions[i] ||= 0; item_impressions[i] += 1 }
  end
  item_impressions.keys.each do |id|
    ActiveRecord::Base.connection.execute "insert into item_stats(item_id, impression_count, collected_on) values('#{id}',#{item_impressions[id]},'#{date}')", 'Insert Item Stats'
  end

  file.close
end

One thing to note - the logger variable is declared in the controller action - not in environment.rb as you would normally do with a logger. I benchmarked this - 10000 writes took about 20 seconds. Averaging about 2 milliseconds a write. With the file name in the envirnment.rb, it took about 14 seconds. We made this trade-off so we could dynamically determine the file name - an easy way to switch files at midnight.

Our main concern at this point - we have no idea how many different items will be counted per day - ie. we don't know how long the tail is. This will determine how many rows are added to the db each day. We expect we'll need to limit how far back we keep daily reports and will role up results even further at that point.

Swards