tags:

views:

104

answers:

5

I have a CSV file the first row of which contains the variables names and the rest of the rows contains the data. What's a good way to break it up into files each containing just one variable in Python? Is this solution going to be robust? E.g. what if the input file is 100G in size? I am trying to perform a divide conquer strategy but is new to Python. Thanks in advance for your help!

The input files looks like

var1,var2,var3
1,2,hello
2,5,yay
...

I want to create 3 (or however many variables) files var1.csv, var2.csv, var3.csv so that files resemble File1

var1
1
2
...

File2

var2
2
5
...

File3

var3
hello
yay
+1  A: 

Open n output files, one input file and read a line at a time. Chop the line up and write the n pieces to each file. You only ever store one line in memory each time, (and I presume the line is not 100GB?)

John Smith
+2  A: 

As lomg as the number of columns isn't absurdly huge (larger than the number of files you can have open at once on your platform), the number of rows, and thus the total size, are no big deal (as long of course as you have ample free space on disk;-) since you'll be processing just a column at a time -- I suggest the following code:

import csv

def splitit(inputfilename):
  with open(inputfilename, 'rb') as inf:
    inrd = csv.reader(inf)
    names = next(inrd)
    outfiles = [open(n+'.csv', 'wb') for n in names]
    ouwr = [csv.writer(w) for w in outfiles]
    for w, n in zip(ouwr, names):
      w.writerow([n])
    for row in inrd:
      for w, r in zip(ouwr, row):
        ouwr.writerow([r])
    for o in outfiles: o.close()
Alex Martelli
minor nitpicks: I suppose you meant w.writerow instead of ouwr.writerow -- w being the csvwriter instance in list: ouwr. Also, the nested loop "zip(ouwr, r)" should be "zip(ouwr, row)" and for o in outfile - should be for o in outfiles. Also, it works.
bhangm
I had to play around with the code a little bit to get it working. Let me test it on a really large dataset, then I will give u the tick you deserve! Thanks
xiaodai
@bhangm, thanks for spotting the issues -- gonna edit to fix them, and, +1!-)
Alex Martelli
+1  A: 

if Python is not a must,

awk -F"," 'NR==1{for(i=1;i<=NF;i++)a[i]=$i}NR>1{for(i=1;i<=NF;i++){print $i>a[i]".txt"}}' file
ghostdog74
Would awk be faster than Python?
xiaodai
yes. most of the time.
ghostdog74
I am a complete noob. What's a good awk implementation on Windows?
xiaodai
go here : gnuwin32.sourceforge.net/packages.html. Look for gawk. There are other *nix tools as well, especially coreutils.
ghostdog74
A: 

Try this:

http://ondra.zizka.cz/stranky/programovani/ruzne/querying-transforming-csv-using-sql.texy

crunch input.csv output.csv "SELECT AVG(duration) AS durAvg FROM (SELECT * FROM indata ORDER BY duration LIMIT 2 OFFSET 6)"
Ondra Žižka
+1  A: 

If your file is 100GB, then disc IO will be your bottleneck. Consider using the gzip module for both read (a precompressed file) and write to speed things up drastically.

Ztyx
Thanks for that. Useful! I am doing it to a SSD and it's still slow. Might give gzip a crack sometime
xiaodai