ansaurus

Question

How to break a large CSV data file into individual data files?

Answer 1

+1 A:

Open n output files, one input file and read a line at a time. Chop the line up and write the n pieces to each file. You only ever store one line in memory each time, (and I presume the line is not 100GB?)

John Smith 2010-07-26 00:48:11

Answer 2

+2 A:

As lomg as the number of columns isn't absurdly huge (larger than the number of files you can have open at once on your platform), the number of rows, and thus the total size, are no big deal (as long of course as you have ample free space on disk;-) since you'll be processing just a column at a time -- I suggest the following code:

import csv

def splitit(inputfilename):
  with open(inputfilename, 'rb') as inf:
    inrd = csv.reader(inf)
    names = next(inrd)
    outfiles = [open(n+'.csv', 'wb') for n in names]
    ouwr = [csv.writer(w) for w in outfiles]
    for w, n in zip(ouwr, names):
      w.writerow([n])
    for row in inrd:
      for w, r in zip(ouwr, row):
        ouwr.writerow([r])
    for o in outfiles: o.close()

Alex Martelli 2010-07-26 00:50:41

minor nitpicks: I suppose you meant w.writerow instead of ouwr.writerow -- w being the csvwriter instance in list: ouwr. Also, the nested loop "zip(ouwr, r)" should be "zip(ouwr, row)" and for o in outfile - should be for o in outfiles. Also, it works.

bhangm 2010-07-26 01:38:00

I had to play around with the code a little bit to get it working. Let me test it on a really large dataset, then I will give u the tick you deserve! Thanks

xiaodai 2010-07-26 10:13:28

@bhangm, thanks for spotting the issues -- gonna edit to fix them, and, +1!-)

Alex Martelli 2010-07-26 19:33:25

Answer 3

+1 A:

if Python is not a must,

awk -F"," 'NR==1{for(i=1;i<=NF;i++)a[i]=$i}NR>1{for(i=1;i<=NF;i++){print $i>a[i]".txt"}}' file

ghostdog74 2010-07-26 01:28:15

Would awk be faster than Python?

xiaodai 2010-07-26 10:16:05

yes. most of the time.

ghostdog74 2010-07-26 10:46:57

I am a complete noob. What's a good awk implementation on Windows?

xiaodai 2010-07-26 11:29:29

go here : gnuwin32.sourceforge.net/packages.html. Look for gawk. There are other *nix tools as well, especially coreutils.

ghostdog74 2010-07-26 11:55:30

Answer 4

A:

Try this:

http://ondra.zizka.cz/stranky/programovani/ruzne/querying-transforming-csv-using-sql.texy

crunch input.csv output.csv "SELECT AVG(duration) AS durAvg FROM (SELECT * FROM indata ORDER BY duration LIMIT 2 OFFSET 6)"

Ondra Žižka 2010-07-26 08:13:35

Answer 5

+1 A:

If your file is 100GB, then disc IO will be your bottleneck. Consider using the gzip module for both read (a precompressed file) and write to speed things up drastically.

Ztyx 2010-07-26 08:28:43

Thanks for that. Useful! I am doing it to a SSD and it's still slow. Might give gzip a crack sometime

xiaodai 2010-07-26 10:18:23

ansaurus

tags:

views:

answers:

How to break a large CSV data file into individual data files?

related questions