tags:

views:

1311

answers:

3

I am writing a Python utility that needs to parse a large, regularly-updated CSV file I don't control. The utility must run on a server with only Python 2.4 available. The CSV file does not quote field values at all, but the Python 2.4 version of the csv library does not seem to give me any way to turn off quoting, it just allows me to set the quote character (dialect.quotechar = '"' or whatever). If I try setting the quote character to None or the empty string, I get an error.

I can sort of work around this by setting dialect.quotechar to some "rare" character, but this is brittle, as there is no ASCII character I can absolutely guarantee will not show up in field values (except the delimiter, but if I set dialect.quotechar = dialect.delimiter, things go predictably haywire).

In Python 2.5 and later, if I set dialect.quoting to csv.QUOTE_NONE, the CSV reader respects that and does not interpret any character as a quote character. Is there any way to duplicate this behavior in Python 2.4?

UPDATE: Thanks Triptych and Mark Roddy for helping to narrow the problem down. Here's a simplest-case demonstration:

>>> import csv
>>> import StringIO
>>> data = """
... 1,2,3,4,"5
... 1,2,3,4,5
... """
>>> reader = csv.reader(StringIO.StringIO(data))
>>> for i in reader: print i
... 
[]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: newline inside string

The problem only occurs when there's a single double-quote character in the final column of a row. Unfortunately, this situation exists in my dataset. I've accepted Tanj's solution: manually assign a nonprinting character ("\x07" or BEL) as the quotechar. This is hacky, but it works, and I haven't yet seen another solution that does. Here's a demo of the solution in action:

>>> import csv
>>> import StringIO
>>> class MyDialect(csv.Dialect):
...     quotechar = '\x07'
...     delimiter = ','
...     lineterminator = '\n'
...     doublequote = False
...     skipinitialspace = False
...     quoting = csv.QUOTE_NONE
...     escapechar = '\\'
... 
>>> dialect = MyDialect()
>>> data = """
... 1,2,3,4,"5
... 1,2,3,4,5
... """
>>> reader = csv.reader(StringIO.StringIO(data), dialect=dialect)
>>> for i in reader: print i
... 
[]
['1', '2', '3', '4', '"5']
['1', '2', '3', '4', '5']

In Python 2.5+ setting quoting to csv.QUOTE_NONE would be sufficient, and the value of quotechar would then be irrelevant. (I'm actually getting my initial dialect via a csv.Sniffer and then overriding the quotechar value, not by subclassing csv.Dialect, but I don't want that to be a distraction from the real issue; the above two sessions demonstrate that Sniffer isn't the problem.)

+4  A: 

I don't know if python would like/allow it but could you use a non-printable ascii code such as BEL or BS (backspace) These I would think to be extremely rare.

Tanj
Wow, good idea. Setting csv.quotechar = '\x07' (BEL) seems to do the trick. Can't imagine how they'd get that into their CSV data.
Carl Meyer
Haha -- nice hack. :-)
cdleary
Nice, hacktastic.
Kiv
+2  A: 

I tried a few examples using Python 2.4.3, and it seemed to be smart enough to detect that the fields were unquoted.

I know you've already accepted a (slightly hacky) answer, but have you tried just leaving the reader.dialect.quotechar value alone? What happens if you do?

Any chance we could get example input?

Triptych
I'm still interested in a less hacky approach, if there is one. I can get some sample input uploaded soon. The dialect I'm using is generated by a csv.Sniffer object (I need to be as robust as possible against format changes). If I leave quotechar alone it seems to default to double-quote '"'.
Carl Meyer
A: 

+1 for Triptych

Confirmation that csv.reader automatically handles csv files with out quotes:

>>> import StringIO
>>> import csv
>>> data="""
... 1,2,3,4,5
... 1,2,3,4,5
... 1,2,3,4,5
... """
>>> reader=csv.reader(StringIO.StringIO(data))
>>> for i in reader:
...     print i
... 
[]
['1', '2', '3', '4', '5']
['1', '2', '3', '4', '5']
['1', '2', '3', '4', '5']
Mark Roddy
This isn't really a relevant test, as no matter what quotechar is set to, quoting is optional; it can handle unquoted fields just fine. The problem is when the quotechar appears in the data; and apparently, only when it appears in the final column. Thanks for pushing me to narrow it down.
Carl Meyer