views:

564

answers:

3

On my Ruby on Rails application I need to execute 50 background jobs in parallel. Each job creates a TCP connection to a different server, fecths some data and updates an active record object.

I know different solutions to perform this task but any of them in parallel. For example, delayed_job (DJ) could be a great solution if only it could execute all jobs in parallel.

Any ideas? Thanks.

+5  A: 

It is actually possible to run multiple delayed_job workers.

From http://github.com/collectiveidea/delayed_job:

# Runs two workers in separate processes.
$ RAILS_ENV=production script/delayed_job -n 2 start
$ RAILS_ENV=production script/delayed_job stop

So, in theory, you could just execute:

$ RAILS_ENV=production script/delayed_job -n 50 start

This will spawn 50 processes, however I'm not sure whether that would be recommended depending on the resources of the system you're running this on.


An alternative option would be to use threads. Simply spawn a new thread for each of your jobs.

One thing to bear is mind with this method is that ActiveRecord is not thread-safe. You can make it thread-safe using the following setting:

ActiveRecord::Base.allow_concurrency = true
Olly
You can also run delayed_job workers on multiple machines. I doubt you'll get much benefit from running more workers on a single machine than you have CPU cores but you could spread out the load by running it on several boxes. If you need to run 50 simultaneously, I think you're going to need to distribute the work.
Luke Francl
I'll get benefit running multiple workers on a single machine becasue most of the workers will be blocked by IO
fjyaniez
+1  A: 

Some thoughts...

  • Just because you need to read 50 sites and naturally want some parallel work does not mean that you need 50 processes or threads. You need to balance the slowdown and overhead. How about having 10 or 20 processes each read a few sites?

  • Depending on which Ruby you are using, be careful about the green threads, you may not get the parallel result you want

  • You might want to structure it like a reverse, client-side inetd, and use connect_nonblock and IO.select to get the parallel connections you want by making all the servers respond in parallel. You don't really need parallel processing of the results, you just need to get in line at all the servers in parallel, because that is where the latency really is.

So, something like this from the socket library...extend it for multiple outstanding connections...

require 'socket'
include Socket::Constants
socket = Socket.new(AF_INET, SOCK_STREAM, 0)
sockaddr = Socket.sockaddr_in(80, 'www.google.com')
begin
  socket.connect_nonblock(sockaddr)
  rescue Errno::EINPROGRESS
  IO.select(nil, [socket])
  begin
    socket.connect_nonblock(sockaddr)
    rescue Errno::EISCONN
  end
end
socket.write("GET / HTTP/1.0\r\n\r\n")
# here perhaps insert IO.select. You may not need multiple threads OR multiple
# processes with this technique, but if you do insert them here
results = socket.read
DigitalRoss
IO.select could be useful in this case, I'll give it a try. Thank you.
fjyaniez
A: 

Since you're working with rails, I would advise you to use delayed_job to do this rather than splitting off into threads or forks. Reason being - dealing with timeouts and stuff when the browser is waiting can be a real pain. There are two approaches you can take with DJ

The first is - spawn 50+ workers. Depending on your environment this may be a pretty memory heavy solution, but it works great. Then when you need to run your job, just make sure you create 50 unique jobs. If there is too much memory bloat and you want to do things this way, make a separate environment that is stripped down, specifically for your workers.

The second way is to create a single job that uses Curl::Multi to run your 50 concurrent TCP requests. You can find out more about this here: http://curl-multi.rubyforge.org/ In that way, you could have one background processor running all of your TCP requests in parallel.

PatrickTulskie