views:

72

answers:

2

I was looking into ruby's parallel/asynchronous processing capabilities and read many articles and blog posts. I looked through EventMachine, Fibers, Revactor, Reia, etc, etc. Unfortunately, I wasn't able to find a simple, effective (and non-IO-blocking) solution for this very simple use case:

File.open('somelogfile.txt') do |file|
  while line = file.gets      # (R) Read from IO
    line = process_line(line) # (P) Process the line
    write_to_db(line)         # (W) Write the output to some IO (DB or file)
  end
end

Is you can see, my little script is performing three operations read (R), process (P) & write (W). Let's assume - for simplicity - that each operation takes exactly 1 unit of time (e.g. 10ms), the current code would therefore do something like this (5 lines):

Time:       123456789012345 (15 units in total)
Operations: RPWRPWRPWRPWRPW

But, I would like it to do something like this:

Time:       1234567 (7 units in total)
Operations: RRRRR
             PPPPP
              WWWWW

Obviously, I could run three processes (reader, processor & writer) and pass read lines from reader into the processor queue and then pass processed lines into the writer queue (all coordinated via e.g. RabbitMQ). But, the use-case is so simple, it just doesn't feel right.

Any clues on how this could be done (without switching from Ruby to Erlang, Closure or Scala)?

A: 

Check out peach (http://peach.rubyforge.org/). Doing a parallel "each" couldn't be simpler. However, as the documentation says, you'll need to run under JRuby in order to use the JVM's native threading.

See Jorg Mittag's response to this SO question for a lot of detail on the multithreading capabilities of the various Ruby interpreters.

Mark Thomas
Hmm, peach isn't really what I am looking for. I don't want to run the RPW in parallel, I want to detach the 3 task from each other and run them asynchronously. Jorg Mittag's response gives a great introduction. I am well aware of the offered options, but none of them seems to have a answer for my problem.
Dim
+1  A: 

If you need it to be truly parallel (from a single process) I believe you'll have to use JRuby to get true native threads and no GIL.

You could use something like DRb to distribute the processing across multiple processes / cores, but for your use case this is a bit much. Instead, you could try having multiple processes communicate using pipes:

$ cat somelogfile.txt | ruby ./proc-process | ruby ./proc-store

In this scenario each piece is its own process that can run in parallel but are communicating using STDIN / STDOUT. This is probably the easiest (and quickest) approach to your problem.

# proc-process
while line = $stdin.gets do
  # do cpu intensive stuff here
  $stdout.puts "data to be stored in DB"
  $stdout.flush # this is important
end

# proc-store
while line = $stdin.gets do
  write_to_db(line)
end
JEH
I thought that Ruby 1.9's GIL allows you to do CPU stuff in one thread while another thread does I/O - that is, it only prohibits two threads doing CPU stuff.
Andrew Grimm
Are you talking about Fibers? My limited understanding of Fibers is that instead of threads that each have a shared amount of CPU time your code explicitly hands off processing to the Fiber which can handle the blocking IO operation and immediately return back to the calling code. This reduces the amount of time that you spend waiting, but I don't think it will allow you to span more than one CPU per process. I think the GIL means only one thread of execution can run at any point in time.http://www.igvita.com/2009/05/13/fibers-cooperative-scheduling-in-ruby/
JEH
Using pipes is a good solution to split the problem into 3 separate processes, but it is not asynchronous. It is in fact a "Ruby workaround", therefore quite difficult to implement within the scope of a bigger application. The "problem" I have outlined above is a simple example of IO driven processing. I am trying to understand what Ruby is capable of in this area and what it might be lacking.
Dim