tags:

views:

592

answers:

6

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:

    data1, data2  ,"data3'''",  'data4""',,,data5,

but this one is malformed:

    data1, data2, da"ta3", 'data4',

-- quotation marks can only be prepended or trailed by spaces.

Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.

I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.

So, maybe someone with experience in parsing something similar could help me on this? (Or maybe this is too complex for regex and I should just write a function)

EDIT1:

csv module is not much of use here:

    >>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
    [['2', ' "dat', 'a1"', " 'dat", "a2'", '']]

    >>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
    [['2', 'dat,a1', "'dat", "a2'", '']]

-- unless this can be tuned?

EDIT2: A few language edits - I hope it's more valid English now

EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.

+4  A: 

Python has a standard library module to read csv files:

import csv

reader = csv.reader(open('file.csv'))

for line in reader:
    print line

For your example input this prints

['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']

EDIT:

you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.

>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
pwdyson
Yes, this would be much cleaner than regex in this case, and note that the module is quite flexible - you can set your own delimiters, quote characters and line terminators fairly easily. See: http://docs.python.org/library/csv.html.
Max Shawabkeh
@Max S, did I overlook something in the docs, or is there a way to set it to handle two different quote characters as the OP requests?
Peter Hansen
@Peter: No, that's the one missing feature that the asker needs, which is why I posted my more general regex solution. However, it's unlikely in a real data, and `csv` is cleaner.
Max Shawabkeh
@Max S, it seemed to me that whether it's unlikely in real data or not, it is an explicit requirement so anything that doesn't deal with that doesn't really answer his question. Maybe I'm wrong.
Peter Hansen
-1 csv module doesn't handle two different quotechars simultaneously
John Machin
the answer worked for the question as originally asked, but when the question was edited to clarify the requirements I modified my answer to make it work as best it could. But csv doesn't seem to handle two different quotecharacters
pwdyson
+4  A: 

While the csv module is the right answer here, a regex that could do this is quite doable:

import re

r = re.compile(r'''
    \s*                # Any whitespace.
    (                  # Start capturing here.
      [^,"']+?         # Either a series of non-comma non-quote characters.
      |                # OR
      "(?:             # A double-quote followed by a string of characters...
          [^"\\]|\\.   # That are either non-quotes or escaped...
       )*              # ...repeated any number of times.
      "                # Followed by a closing double-quote.
      |                # OR
      '(?:[^'\\]|\\.)*'# Same as above, for single quotes.
    )                  # Done capturing.
    \s*                # Allow arbitrary space before the comma.
    (?:,|$)            # Followed by a comma or the end of a string.
    ''', re.VERBOSE)

line = r"""data1, data2  ,"data3'''",  'data4""',,,data5,"""

print r.findall(line)

# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']

EDIT: To validate lines, you can reuse the regex above with small additions:

import re

r_validation = re.compile(r'''
    ^(?:    # Capture from the start.
      # Below is the same regex as above, but condensed.
      # One tiny modification is that it allows empty values
      # The first plus is replaced by an asterisk.
      \s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
    )*$    # And don't stop until the end.
    ''', re.VERBOSE)

line1 = r"""data1, data2  ,"data3'''",  'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""

if r_validation.match(line1):
    print 'Line 1 is valid.'
else:
    print 'Line 1 is INvalid.'

if r_validation.match(line2):
    print 'Line 2 is valid.'
else:
    print 'Line 2 is INvalid.'

# Prints:
#    Line 1 is valid.
#    Line 2 is INvalid.
Max Shawabkeh
I'm fairly sure you need to switch both `[^"]|\\.` occurrences (and make them like this: `\\.|[^"]`) or change them to `[^"\\]|\\.`. Because `[^"]` will match a backslash, an escaped backslash or double quote will not be properly *parsed*.
Bart Kiers
Right you are. I always forget to put the backslash in the character class. Simply switching them wouldn't work for validating though as the engine can simply backtrack on the backslash.
Max Shawabkeh
True, adding them in the negated character class is the better option. I always notice this because I myself slipped up so many times! :)
Bart Kiers
-1 csv module is NOT the right answer here; it doesn't handle two different quotechars simultaneously
John Machin
That regex look amazing but I'm afraid of going with it as some edge cases might slip through. +1 though.
Tomasz Zielinski
+4  A: 

Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).

In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...

You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.

Peter Hansen
+1 for the link to the related question, plus the nice description of when you feel that reaching for pyparsing is appropriate. Unlike the other question, which had a very rigid format, this question would be more likely to use something like `delimitedList(Word(alphanums) | quotedString)`, since a delimitedList looks for comma delimiters by default.
Paul McGuire
A: 

This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).

Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.

Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?

I'm sorry I don't know python regex's but in perl this would look something like this:

# Look for one or more letter/number at least one ' or " or more and at least one    
#  or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
    # Prints the line if the above regex is found
    print $line;

}

Just simply convert that for when you look at a line.

I'm sorry if I misunderstood the question

I hope it helps!

onaclov2000
+1  A: 

It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.

It evidently contains rules like:

If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).

What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.

You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5, fails on the first comma.

How is your data encoded?

John Machin
This is one of those cases when you have to "be liberal in what you accept" - I don't know how writer works (it's closed source web service, probably subject to change), I only have some data examples and a few guidelines that says more or less "you can expect almost anything".
Tomasz Zielinski
"be liberal" etc applies when you have a standard and you have determined that writers stray slightly in a manner that can be tolerated by a reader with no ill effects. You have no standard and no experience with writers. Yes, you can expect almost anything, but have shown no evidence of acting on that expectation. I suggest that you be rather intolerant: set an expectation, program for that, and log differences from expectation. Checking for min/max number of columns in each dataset is always worth doing. Use a high-level tool like pyParsing to minimise time-to-fix. quote/comma? encoding?
John Machin
A: 

If your goal is to convert the data to XML (or JSON, or YAML), look at this example for a Gelatin syntax that produces the following output:

<xml>
  <line>
    <column>data1</column>
    <column>data2  </column>
    <column>data3'''</column>
    <column>data4""</column>
    <column/>
    <column/>
    <column>data5</column>
    <column/>
  </line>
</xml>

Note that Gelatin also has a Python API:

from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')
knipknap