views:

179

answers:

4

I am planning on using delayed job to run some background analytics. In my initial test I saw tremendous amount of memory usage, so I basically created a very simple task that runs every 2 minutes just to observe how much memory is is being used.

The task is very simple and the analytics_eligbile? method always return false, given where the data is now, so basically none of the heavy hitting code is being called. I have around 200 Posts in my sample data in development. Post has_one analytics_facet.

Regardless of the internal logic/business here, the only thing this task is doing is calling the analytics_eligible? method 200 times every 2 minutes. In a matter of 4 hours my physical memory usage is at 110MB and Virtual memory at 200MB. Just for doing something this simple! I can't even begin to imagine how much memory this will eat if its doing real analytics on 10,000 Posts with real production data!! Granted it may not run evevery 2 minutes, more like every 30, still I don't think it will fly.

This is running ruby 1.9.7, rails 2.3.5 on Ubuntu 10.x 64 bit. My laptop has 4GB memory, dual core CPU.

Is rails really this bad or am I doing something wrong?

 Delayed::Worker.logger.info('RAM USAGE Job Start: ' + `pmap #{Process.pid} | tail -1`[10,40].strip)
Post.not_expired.each do |p|
    if p.analytics_eligible?
        #this method is never called
        Post.find_for_analytics_update(p.id).update_analytics
    end
end
Delayed::Worker.logger.info('RAM USAGE Job End: ' + `pmap #{Process.pid} | tail -1`[10,40].strip)

Delayed::Job.enqueue PeriodicAnalyticsJob.new(), 0, 2.minutes.from_now

Post Model

def analytics_eligible?
        vf = self.analytics_facet
        if self.total_ratings > 0 && vf.nil?
            return true
        elsif !vf.nil? && vf.last_update_tv > 0
            ratio = self.total_ratings / vf.last_update_tv
            if (ratio - 1) >= Constants::FACET_UPDATE_ELIGIBILITY_DELTA
                return true
            end
        end
        return false
    end
A: 

It is a fact that Ruby consumes (and leaks) memory. I don't know if you can do much about it, but at least I recommend that you take a look on Ruby Enterprise Edition.

REE is an open source port which promises "33% less memory" among all the other good things. I have used REE with Passenger in production for almost two years now and I'm very pleased.

Petrus Repo
Well, I certain things about RoR so far, but if it's this bad, it really dissapoints. I am trying REE now, thanks!
badnaam
REE's promise of "33% less memory usage" is due to process forking after the Rails framework itself has been loaded. In a single process, it won't have a significant effect.
Chris Heald
+1  A: 

If you are experiencing memory issues, one solution is to use another background processing tech, like resque. It is the BG processing used by github.

Thanks to Resque's parent / child architecture, jobs that use too much memory release that memory upon completion. No unwanted growth

How?

On certain platforms, when a Resque worker reserves a job it immediately forks a child process. The child processes the job then exits. When the child has exited successfully, the worker reserves another job and repeats the process.

You can find more technical details in README.

Vlad Zloteanu
Thanks. On what platforms does this parent/child architecture work?
badnaam
I know it works on Linux and OS X. Possibly it does not work on Windows?
wuputah
+5  A: 

ActiveRecord is fairly memory-hungry - be very careful when doing selects, and be mindful that Ruby automatically returns the last statement in a block as the return value, potentially meaning that you're passing back an array of records that get saved as a result somewhere and thus aren't eligible for GC.

Additionally, when you call "Post.not_expired.each", you're loading all your not_expired posts into RAM. A better solution is find_in_batches, which specifically only loads X records into RAM at a time.

Fixing it could be something as simple as:

def do_analytics
  Post.not_expired.find_in_batches(:batch_size => 100) do |batch|
    batch.each do |post|
      if post.analytics_eligible?
        #this method is never called
        Post.find_for_analytics_update(post.id).update_analytics
      end
    end
  end
  GC.start
end

do_analytics

A few things are happening here. First, the whole thing is scoped in a function to prevent variable collisions from holding onto references from the block iterators. Next, find_in_batches retrieves batch_size objects from the DB at a time, and as long as you aren't building references to them, become eligible for garbage collection after each iteration runs, which will keep total memory usage down. Finally, we call GC.start at the end of the method; this forces the GC to start a sweep (which you wouldn't want to do in a realtime app, but since this is a background job, it's okay if it takes an extra 300ms to run). It also has the very distinct benefit if returning nil, which means that the result of the method is nil, which means we can't accidentally hang on to AR instances returned from the finder.

Using something like this should ensure that you don't end up with leaked AR objects, and should vastly improve both performance and memory usage. You'll want to make sure you aren't leaking elsewhere in your app (class variables, globals, and class references are the worst offenders), but I suspect that this'll solve your problem.

All that said, this is a cron problem (periodic recurring work), rather than a DJ problem, in my opinion. You can have a one-shot analytics parser that runs your analytics every X minutes with script/runner, invoked by cron, which very neatly cleans up any potential memory leaks or misuses per-run (since the whole process terminates at the end)

Chris Heald
The only thing I would add to this excellent answer is a note that any Rails process will consume quite a lot of memory - your 110mb is not uncommon. This isn't indicative of a memory leak in your code, or how much processing you've done. Processing 1000 records or 10M records will use the same amount of memory if you've done things properly (the way Chris has explained).
wuputah
+2  A: 

Loading data in batches and using the garbage collector aggressively as Chris Heald has suggested is going to give you some really big gains, but another area people often overlook is what frameworks they're loading in.

Loading a default Rails stack will give you ActionController, ActionMailer, ActiveRecord and ActiveResource all together. If you're building a web application you may not be using all of these, but you're probably using most.

When you're building a background job, you can avoid loading things you don't need by creating a custom environment for that:

# config/environments/production_bg.rb

config.frameworks -= [ :action_controller, :active_resource, :action_mailer ]

# (Also include config directives from production.rb that apply)

Each of these frameworks will just be sitting around waiting for an email that will never be sent, or a controller that will never be called. There's simply no point in loading them. Adjust your database.yml file, set your background job to run in the production_bg environment, and you'll have a much cleaner slate to start with.

Another thing you can do is use ActiveRecord directly without loading Rails at all. This might be all that you need for this particular operation. I've also found using a light-weight ORM like Sequel makes your background job very light-weight if you're doing mostly SQL calls to reorganize records or delete old data. If you need access to your models and their methods you will need to use ActiveRecord, though. Sometimes it's worth re-implementing simple logic in pure SQL for reasons of performance and efficiency, though.

When measuring memory usage, the only number to be concerned with is "real" memory. The virtual amount contains shared libraries and the cost of these is spread amongst every process using them even though it is counted in full for each one.

In the end, if running something important takes 100MB of memory but you can get it down to 10MB with three weeks of work, I don't see why you'd bother. 90MB of memory costs at most about $60/year on a managed provider which is usually far less expensive than your time.

Ruby on Rails embraces the philosophy of being more concerned with your productivity and your time than about memory usage. If you want to trim it back, put it on a diet, you can do it but it will take a bit of effort.

tadman
Good points! Thank you very much!
badnaam