views:

129

answers:

2

the following function parses a CSV file into a list of dictionaries, where each element in the list is a dictionary where the values are indexed by the header of the file (assumed to be the first line.)

this function is very very slow, taking ~6 seconds for a file that's relatively small (less than 30,000 lines.)

how can I speed it up?

def csv2dictlist_raw(filename, delimiter='\t'):
    f = open(filename)
    header_line = f.readline().strip()
    header_fields = header_line.split(delimiter)
    dictlist = []
    # convert data to list of dictionaries
    for line in f:
    values = map(tryEval, line.strip().split(delimiter))
    dictline = dict(zip(header_fields, values))
    dictlist.append(dictline)
    return (dictlist, header_fields)

in response to comments:

I know there's a csv module and I can use it like this:

data = csv.DictReader(my_csvfile, delimiter=delimiter)

this is much faster. However, the problem is that it doesn't automatically cast things that are obviously floats and integers to be numeric and instead makes them strings. How can I fix this?

Using the "Sniffer" class does not work for me. When I try it on my files, I get the error:

File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/csv.py", line 180, in sniff
    raise Error, "Could not determine delimiter"
Error: Could not determine delimiter

How can I make DictReader parse the fields into their types when it's obvious?

thanks.

thanks.

+1  A: 

I see several issues with your code:

  • Why do you need dicts? The keys are stored in each dict instance which blows up memory consumption.

  • Do you really need to hold all instances in memory or would it be an option to use yield?

  • Trying to convert each value takes time and makes no sense in my option. If you have a column having the values "abc" and "123" the last value should probably be a string. So the type of a column should be fixed and you should make conversion explicit.

  • Even if you want to use your conversion logic: Use the csv module and convert values afterwards.

Achim
I agree it should not be all loaded into memory - that's fine, and csv.DictReader doesn't do that. But I do think that dictreader should definitely recognize that when a field has all entries that look like 5.34 or 0.05 then it should be a float, and not a string, and in my files it does not do that (Sniff is able to detect the delimiter / line terminator, but not that.) How can I just make it that obvious types are inferred?
Also: I don't want to convert each row independently -- I agree each column's type should be fixed, but I cannot get DictReader to infer that one column has all floats even though all of the values in that column in the file are floats.
+2  A: 
import ast

# find field types
for row in csv.DictReader(my_csvfile, delimiter=delimiter):
    break
else:
    assert 0, "no rows to process"
cast = {}
for k, v in row.iteritems():
    for f in (int, float, ast.literal_eval):
        try: 
            f(v)
            cast[k] = f
            break
        except (ValueError, SyntaxError):
            pass
    else: # no suitable conversion
        cast[k] = lambda x: x.decode(encoding)

# read data
my_csvfile.seek(0)

data = [dict((k.decode(encoding), cast[k](v)) for k, v in row.iteritems())
        for row in csv.DictReader(my_csvfile, delimiter=delimiter)]
J.F. Sebastian
thanks very much. Another question on this: how can I get DictWriter to serialize out the reader from a csv file that I read in? Do I have to write a custom method for that?
@user248237: `writer.writerows(data)` serializes list of dicts. If your data contains non-ascii strings then you need custom writer similar to `UnicodeWriter` from http://docs.python.org/library/csv.html#examples
J.F. Sebastian