ansaurus

Question

writing large CSV files - dictionary based CSV writer seems to be the problem

Answer 1

+1 A:

The obvious optimisation is to use a csv.writer instead of a DictWriter, passing in iterables for each row instead of dictionaries. Does that not help?

When you say "the number of words", do you mean the number of columns in the CSV? Because I've never seen a CSV that needs thousands of columns! Maybe you have transposed your data and are writing columns instead of rows? Each row should represent one datum, with sections as defined by the columns. If you really do need that sort of size, maybe a database is a better choice?

katrielalex 2010-08-31 21:47:19

Well - thats what I was doing initially, and it makes the code much messier, just the way it is at the moment - that really would be the last resort.*Update:* its a bag of words analysis destined for R - I need to create a column for every single word it comes across...

Malang 2010-08-31 21:49:07

@Malang: " I need to create a column for every single word it comes across". If this is your requirement, why does adding columns (as well as rows) bother you? Clearly it's **O** (n*m) and it won't scale well. What's your question?

S.Lott 2010-08-31 22:44:10

well - its made it nearly intractable computationally, and I was hoping there was some avenues for optimization...

Malang 2010-08-31 22:57:23

@Malang: "nearly intractable computationally, and I was hoping there was some avenues for optimization"?? What? If it's intractable, it's intractable. You need to fundamentally change the algorithm. You can't optimize **O** (n*m) processing into **O** (n) processing.

S.Lott 2010-09-01 14:23:14

ok - got it. Thank you.

Malang 2010-09-01 14:28:33

Answer 2

+1 A:

Ok, this is by no means the answer but i looked up the source-code for the csv module and noticed that there is a very expensive if not check in the module (§ 136-141 in python 2.6).

if self.extrasaction == "raise":
    wrong_fields = [k for k in rowdict if k not in self.fieldnames]
    if wrong_fields:
        raise ValueError("dict contains fields not in fieldnames: " +
                         ", ".join(wrong_fields))
return [rowdict.get(key, self.restval) for key in self.fieldnames]

so a quick workaround seems to be to pass extrasaction="ignore" when creating the writer. This seems to speed up things very substantially.

Not a perfect solution, and perhaps somewhat obvious, but just posting it is helpful to somebody else..

Malang 2010-08-31 23:02:36

ansaurus

tags:

views:

answers:

writing large CSV files - dictionary based CSV writer seems to be the problem

related questions