tags:

views:

63

answers:

2

Hi,

I have a big bag of words array (words, and their counts) that I need to write to large flat csv file.

In testing with around 1000 or so words, this works just fine - I use the dictwriter as follows:

self.csv_out = csv.DictWriter(open(self.loc+'.csv','w'), quoting=csv.QUOTE_ALL, fieldnames=fields)

where fields is list of words (i.e. the keys, in the dictionary that I pass to csv_out.writerow).

However, it seems that this is scaling horribly, and as the number of words increase - the time required to write a row increases exponentially. The dict_to_list method in csv seems to be the instigator of my troubles.

I'm not entirely as to how to even begin to optimize here ? any faster CSV routines I could use ?

+1  A: 

The obvious optimisation is to use a csv.writer instead of a DictWriter, passing in iterables for each row instead of dictionaries. Does that not help?

When you say "the number of words", do you mean the number of columns in the CSV? Because I've never seen a CSV that needs thousands of columns! Maybe you have transposed your data and are writing columns instead of rows? Each row should represent one datum, with sections as defined by the columns. If you really do need that sort of size, maybe a database is a better choice?

katrielalex
Well - thats what I was doing initially, and it makes the code much messier, just the way it is at the moment - that really would be the last resort.*Update:* its a bag of words analysis destined for R - I need to create a column for every single word it comes across...
Malang
@Malang: " I need to create a column for every single word it comes across". If this is your requirement, why does adding columns (as well as rows) bother you? Clearly it's **O** (n*m) and it won't scale well. What's your question?
S.Lott
well - its made it nearly intractable computationally, and I was hoping there was some avenues for optimization...
Malang
@Malang: "nearly intractable computationally, and I was hoping there was some avenues for optimization"?? What? If it's intractable, it's intractable. You need to fundamentally change the algorithm. You can't optimize **O** (n*m) processing into **O** (n) processing.
S.Lott
ok - got it. Thank you.
Malang
+1  A: 

Ok, this is by no means the answer but i looked up the source-code for the csv module and noticed that there is a very expensive if not check in the module (§ 136-141 in python 2.6).

if self.extrasaction == "raise":
    wrong_fields = [k for k in rowdict if k not in self.fieldnames]
    if wrong_fields:
        raise ValueError("dict contains fields not in fieldnames: " +
                         ", ".join(wrong_fields))
return [rowdict.get(key, self.restval) for key in self.fieldnames]

so a quick workaround seems to be to pass extrasaction="ignore" when creating the writer. This seems to speed up things very substantially.

Not a perfect solution, and perhaps somewhat obvious, but just posting it is helpful to somebody else..

Malang