I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:
data1, data2 ,"data3'''", 'data4""',,,data5,
but this one is malformed:
data1, data2, da"ta3", 'data4',
-- quotation marks can only be prepended or trailed by spaces.
Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.
I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.
So, maybe someone with experience in parsing something similar could help me on this? (Or maybe this is too complex for regex and I should just write a function)
EDIT1:
csv
module is not much of use here:
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
[['2', ' "dat', 'a1"', " 'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
[['2', 'dat,a1', "'dat", "a2'", '']]
-- unless this can be tuned?
EDIT2: A few language edits - I hope it's more valid English now
EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.