ansaurus

Question

Splitting huge file based on contents with ruby

Answer 1

+1 A:

This should get around the multi-open-write-close issue, although it might run into problems if the number of files becomes large; I can't say, I never had hundreds of files open for write!

The first line is the important one: for each new key encountered it opens a new file and stores it against that key in the hash. The last line closes all the files opened.

files = Hash.new { |h, k| h[k] = File.open("#{k}.csv", 'w+') }
open('hugefile').each do |hline|
  files[hline[0,12]].puts hline[13,10000].gsub(/;/,",")
end
files.each { |n, f| f.close }

Mike Woodhouse 2009-09-21 23:34:01

Interesting solution. I'll benchmark it at work as well. (And yes, I need to open some 2000 files.)

Zsolt Botykai 2009-09-22 04:59:24

Yikes! 2000 files? I just thought - is the input sorted? If so, glenn's solution is probably optimal.

Mike Woodhouse 2009-09-22 07:56:43

Mike's solution is more "Ruby way". If keeping 2000 open files handles simultaneously will yield performance problems, you can extend `files` Hash functionality to close some files if it becomes too large.It is all excessive, if input is sorted.

samuil 2009-09-22 13:38:10

Answer 2

+2 A:

Your awk solution will have to open the file just as many times, so I would think you'd get the same resource usage.

You can keep the file open until $1 changes:

prev = nil
File.foreach('hugefile') do |hline|
  accno = hline[0,12]
  nline = hline[13,10000].gsub(/;/,",")
  if prev != accno
    accfile.close rescue nil
    accfile = File.open("#{accno.to_s}.csv", "a")
    prev = accno
  end
  accfile.puts nline
end

glenn jackman 2009-09-22 02:02:31

Do AWK reopens (and closes) the file on every line it's processing? Anyway - thanks for your answer, I'll test it at work, but yeah, it looks what I was looking for.

Zsolt Botykai 2009-09-22 04:57:55

No, I meant when you "print $0 >> somefile", that will open, append and close the file.

glenn jackman 2009-09-22 05:04:42

I see, but my awk solution does it for every line IMO.

Zsolt Botykai 2009-09-22 11:21:03

I was just reading the nawk man page (for a different problem) and found that output redirection only opens the file if it's not already open, and keeps it open until explicitly closed. So, +1 for awk.

glenn jackman 2009-09-22 13:35:23

Glenn was the benchmark winner.

Zsolt Botykai 2009-11-09 13:02:18

ansaurus

tags:

views:

answers:

Splitting huge file based on contents with ruby

related questions