views:

114

answers:

2

Good day all,

We are doing a data migration from one system to a Rails application. Some of the tables we are working with are very large and moving them over 1 record at a time using ActiveRecord takes far too long. Therefore we resorted to copying the table over in SQL and validating after the fact.

The one-by-one validation check is still slow, but the speed increase from the SQL copy more than makes up for it. However, that hasn't quenched our thirst to see if we can get the validation check to happen more quickly. We attempted to split the table into chunks and pass each chunk to a Thread but it actually executed slower.

The question is, large table, currently iterating row-by-row to do the validation, like so

Model.find_each do |m|
  logger.info "M #{m.id} is not valid" unless m.valid?
end

Anyone have any recommendations on how to speed this up?

Thanks

peer

EDIT I should say not specifically this code. We are looking for recommendations on how we can run this concurrently, giving each process a chunk of data, without needed a machine per process

+2  A: 

find_each is using find_in_batches, which fetches 1000 rows at a time by default. You could try playing with the batch_size option. The way you have it above seems pretty optimal; it's fetching from the database in batches and iterating over each one, which you need to do. I would monitor your RAM to see if the batch size is optimal, and you could also try using Ruby 1.9.1 to speed things up if you're currently using 1.8.*.

http://api.rubyonrails.org/classes/ActiveRecord/Batches/ClassMethods.html#M001846

zgchurch
A: 

I like zgchurch's response as a starting point.

What I would add is that threading is definitely not going to help here, especially because Ruby uses green threads (at least in 1.8.x), so there is no opportunity to utilize multiple processors anyway. Even if that weren't the case it's very likely that this operation is IO-heavy enough that you would get IO contention eating into any multi-core benefits.

Now if you really want to speed this up you should take a look at the actual validations and figure out a more efficient way to achieve them. Just loading all the rows and instantiating an ActiveRecord object is going to tend to dominate the performance in most validation situations. You may be spending 90-99.99% of your time just loading and unloading the data from memory.

In these types of situations I tend to go towards raw SQL. You can do things like validating foreign key integrity tens of thousands of times faster than raw ActiveRecord validation callbacks. Of course the viability of this approach depends on the actual ins and outs of your validations. Even if you need something a little richer than SQL to define validity, you could still probably get a 10-100x speed increase just be loading the minimal data with a thinner SQL interface and examining the data directly. If that's the case Perl or Python might be a better choice for raw performance.

dasil003
good points, i have been reluctant to try and duplicate the validations in SQL, but you are probably right in that it would provide the best performance
Peer Allan