views:

389

answers:

4

I'd like to slice and dice large datafiles, up to a gig, in a fairly quick and efficient manner. If I use something like UNIX's "CUT", it's extremely fast, even in a CYGWIN environment.

I've tried developing and benchmarking various Ruby scripts to process these files, and always end up with glacial results.

What would you do in Ruby to make this not so dog slow?

+2  A: 

This question reminds me of Tim Bray's Wide Finder project. The fastest way he could read an Apache logfile using Ruby and figure out which articles have been fetched the most was with this script:

counts = {}
counts.default = 0

ARGF.each_line do |line|
   if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
     counts[$1] += 1
   end
end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
    puts "#{counts[key]}: #{key}"
end

It took this code 7½ seconds of CPU, 13½ seconds elapsed, to process a million and change records, a quarter-gig or so, on last year’s 1.67Ghz PowerBook.

Mark Cidade
+1  A: 

I'm guessing that your Ruby implementations are reading the entire file prior to processing. Unix's cut works by reading things one byte at a time and immediately dumping to an output file. There is of course some buffering involved, but not more than a few KB.

My suggestion: try doing the processing in-place with as little paging or backtracking as possible.

Daniel Spiewak
A: 

I doubt the problem is that ruby is reading the whole file in memory. Look at the memory and disk usage while running the command to verify.

I'd guess the main reason is because cut is written in C and is only doing one thing, so it has probably be compiled down to the very metal. It's probably not doing a lot more than calling the system calls.

However the ruby version is doing many things at once. Calling a method is much slower in ruby than C function calls.

Remember old age and trechery beat youth and skill in unix: http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/

Rory
+1  A: 

Why not combine them together - using cut to do what it does best and ruby to provide the glue/value add with the results from CUT? you can run shell scripts by putting them in backticks like this:

puts `cut somefile > foo.fil`
# process each line of the output from cut
f = File.new("foo.fil")
f.each{|line|
}
MikeJ
rather than writing to a temp file, you might do: pipe=IO.popen("cut ..."); pipe.each_line {|line| ...}
glenn jackman