ansaurus

Question

Answer 1

+1 A:

I see several issues with your code:

Why do you need dicts? The keys are stored in each dict instance which blows up memory consumption.
Do you really need to hold all instances in memory or would it be an option to use yield?
Trying to convert each value takes time and makes no sense in my option. If you have a column having the values "abc" and "123" the last value should probably be a string. So the type of a column should be fixed and you should make conversion explicit.
Even if you want to use your conversion logic: Use the csv module and convert values afterwards.

Achim 2010-06-05 19:11:47

I agree it should not be all loaded into memory - that's fine, and csv.DictReader doesn't do that. But I do think that dictreader should definitely recognize that when a field has all entries that look like 5.34 or 0.05 then it should be a float, and not a string, and in my files it does not do that (Sniff is able to detect the delimiter / line terminator, but not that.) How can I just make it that obvious types are inferred?

2010-06-05 19:13:59

Also: I don't want to convert each row independently -- I agree each column's type should be fixed, but I cannot get DictReader to infer that one column has all floats even though all of the values in that column in the file are floats.

2010-06-05 19:18:10

Answer 2

+2 A:

import ast

# find field types
for row in csv.DictReader(my_csvfile, delimiter=delimiter):
    break
else:
    assert 0, "no rows to process"
cast = {}
for k, v in row.iteritems():
    for f in (int, float, ast.literal_eval):
        try: 
            f(v)
            cast[k] = f
            break
        except (ValueError, SyntaxError):
            pass
    else: # no suitable conversion
        cast[k] = lambda x: x.decode(encoding)

# read data
my_csvfile.seek(0)

data = [dict((k.decode(encoding), cast[k](v)) for k, v in row.iteritems())
        for row in csv.DictReader(my_csvfile, delimiter=delimiter)]

J.F. Sebastian 2010-06-05 19:52:25

thanks very much. Another question on this: how can I get DictWriter to serialize out the reader from a csv file that I read in? Do I have to write a custom method for that?

2010-06-05 20:18:57

@user248237: `writer.writerows(data)` serializes list of dicts. If your data contains non-ascii strings then you need custom writer similar to `UnicodeWriter` from http://docs.python.org/library/csv.html#examples

J.F. Sebastian 2010-06-05 20:53:25

ansaurus

tags:

views:

answers:

speeding up parsing of files

related questions