tags:

views:

237

answers:

4

A trivial CSV line could be spitted using string split function. But some lines could have ", e.g.

"good,morning", 100, 300, "1998,5,3"

thus directly using string split would not solve the problem.

My solution is to first split out the line using , and then combining the strings with " at then begin or end of the string.

What's the best practice for this problem?

I am interested if there's a Python or F# code snippet for this.

EDIT: I am more interested in the implementation detail, rather than using a library.

+11  A: 

There's a csv module in Python, which handles this.

Edit: This task falls into "build a lexer" category. The standard way to do such tasks is to build a state machine (or use a lexer library/framework that will do it for you.)

The state machine for this task would probably only need two states:

  • Initial one, where it reads every character except comma and newline as part of field (exception: leading and trailing spaces) , comma as the field separator, newline as record separator. When it encounters an opening quote it goes into
  • read-quoted-field state, where every character (including comma & newline) excluding quote is treated as part of field, a quote not followed by a quote means end of read-quoted-field (back to initial state), a quote followed by a quote is treated as a single quote (escaped quote).

By the way, your concatenating solution will break on "Field1","Field2" or "Field1"",""Field2".

Rafał Dowgird
Like with most parsing problems, it's a more sustainable practice to use a library if one exists. If the OP is really interested in the implementation, I'm sure the Python library is open source.
Benjamin Oakes
As we say in the Python community: "Use the source, Luke". It's totally open and already installed with Python. Just read it.
S.Lott
+3  A: 

From python's CSV module:

reading a normal CSV file:

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    print row

Reading a file with an alternate format:

import csv
reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
    print row

There are some nice usage examples in LinuxJournal.com.

If you're interested with the details, read "split string at commas respecting quotes when string not in csv format" showing some nice regexen related to this problem, or simply read the csv module source.

Adam Matan
+1  A: 

Chapter 4 of The Practice of Programming gave both C and C++ implementations of the CSV parser.

E.T
+1  A: 

The generic implementation detail would be something like this (untested)

def csvline2fields(line):
    fields = []
    quote = None
    while line.strip():
        line = line.strip()
        if line[0] in ("'", '"'):
            # Find the next quote:
            end = line.find(line[0])
            fields.append(line[1:end])
            # Find the beginning of the next field
            next = line.find(SEPARATOR)
            if next == -1:
                break
            line = line[next+1:]
            continue
        # find the next separator:
        next = line.find(SEPARATOR)
        fields.append(line[0:next])
        line = line[next+1:]
Lennart Regebro
Actually, the recommendation to look at the CSV module in the Python open source is way better. Silly me.
Lennart Regebro