views:

217

answers:

2

Hi,

Disclaimer: I'm not a programmer, never was, never learned algorithms, CS, etc. Just have to work with it.

My question is: I need to split a huge (over 4 GB) CSV file into smaller ones (then process it with require 'win32ole') based on the first field. In awk it's rather easy:

awk -F ',' '{myfile=$1 ; print $0 >> (myfile".csv")}' KNAGYFILE.csv

But with ruby I did:

open('hugefile').each { |hline|
    accno = hline[0,12]
    nline = hline[13,10000].gsub(/;/,",")
    accfile = File.open("#{accno.to_s}.csv", "a")
    accfile.puts nline
    accfile.close
}

Then recognized that it's resource inefficient (several file open/close). I'm sure there's a better way to do it, could You explain me how?

UPDATE: just forgot to mention, that the file is sorted on the first column. E.g. if this is hugefile:

012345678901,1,1,1,1,1,1
012345678901,1,2,1,1,1,1
012345678901,1,1,A,1,1,1
012345678901,1,1,1,1,A,A
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1

Then I need two new files, named 012345678901.csv and A12345678901.csv.

+1  A: 

This should get around the multi-open-write-close issue, although it might run into problems if the number of files becomes large; I can't say, I never had hundreds of files open for write!

The first line is the important one: for each new key encountered it opens a new file and stores it against that key in the hash. The last line closes all the files opened.

files = Hash.new { |h, k| h[k] = File.open("#{k}.csv", 'w+') }
open('hugefile').each do |hline|
  files[hline[0,12]].puts hline[13,10000].gsub(/;/,",")
end
files.each { |n, f| f.close }
Mike Woodhouse
Interesting solution. I'll benchmark it at work as well. (And yes, I need to open some 2000 files.)
Zsolt Botykai
Yikes! 2000 files? I just thought - is the input sorted? If so, glenn's solution is probably optimal.
Mike Woodhouse
Mike's solution is more "Ruby way". If keeping 2000 open files handles simultaneously will yield performance problems, you can extend `files` Hash functionality to close some files if it becomes too large.It is all excessive, if input is sorted.
samuil
+2  A: 

Your awk solution will have to open the file just as many times, so I would think you'd get the same resource usage.

You can keep the file open until $1 changes:

prev = nil
File.foreach('hugefile') do |hline|
  accno = hline[0,12]
  nline = hline[13,10000].gsub(/;/,",")
  if prev != accno
    accfile.close rescue nil
    accfile = File.open("#{accno.to_s}.csv", "a")
    prev = accno
  end
  accfile.puts nline
end
glenn jackman
Do AWK reopens (and closes) the file on every line it's processing? Anyway - thanks for your answer, I'll test it at work, but yeah, it looks what I was looking for.
Zsolt Botykai
No, I meant when you "print $0 >> somefile", that will open, append and close the file.
glenn jackman
I see, but my awk solution does it for every line IMO.
Zsolt Botykai
I was just reading the nawk man page (for a different problem) and found that output redirection only opens the file if it's not already open, and keeps it open until explicitly closed. So, +1 for awk.
glenn jackman
Glenn was the benchmark winner.
Zsolt Botykai