views:

148

answers:

4

I'm looking to parse these kinds of strings into lists in Python:

"a,b,c",d,"e,f"        =>  ['a','b','c'] , ['d'] , ['e','f']
"a,b,c",d,e            =>  ['a','b','c'] , ['d'] , ['e']
a,b,"c,d,e,f"          =>  ['a'],['b'],['c','d','e','f']
a,"b,c,d",{x(a,b,c-d)} =>  ['a'],['b','c','d'],[('x',['a'],['b'],['c-d'])]

It nests, so I suspect regular expressions are out. All I can think of is to start counting quotes and brackets to parse it, but that seems horribly inelegant. Or perhaps to first match quotes and replace commas between them with somechar, then split on commas, until all the nesting is done, and finally re-split on somechar.

Any thoughts?

A: 

do you have quotes in strings?

If no - just replace control characters to make is compatible with JSON and use JSON parser

Андрей Костенко
A: 

For the first three cases, you can just recursively apply the CSV reader:

import csv

def expand( st ):
    if "," not in st:
        return st
    return [ expand( col ) for col in csv.reader( [ st ] ).next() ]

print expand( '"a,b,c",d,"e,f"' )
print expand( '"a,b,c",d,e' )
print expand( 'a,b,"c,d,e,f"' )
Boojum
A: 

One method I use in PHP for things like that is to replace the deepest point of a nested expression (in this case, "{x(a,b,c-d)}") with a symbol, like '¶1', then save its parsed value (being [('x',['a'],['b'],['c-d'])]) to the variable $nest1.

You now have the original string 'a,"b,c,d",{x(a,b,c-d)}' looking like 'a,"b,c,d",¶1' which is parsed just like the first three. Then simply search the resultant array for anything that begins with '¶' and replace it with its associated variable.

This method supports as many levels as you want, just keep looping/recursing until all the symbols are gone. For example,

'a,"b,c,d",{x(a,b,{y(j,k,l-m)},c-d)}'
'a,"b,c,d",{x(a,b,¶1,c-d)}' and $nest1=[('y',['j'],['k'],['l-m'])]
'a,"b,c,d",¶2' and $nest2=[('x',['a'],['b'],['¶1'],['c-d'])]
['a'],['b','c','d'],['¶2']
['a'],['b','c','d'],[('x',['a'],['b'],['¶1'],['c-d'])]
['a'],['b','c','d'],[('x',['a'],['b'],[('y',['j'],['k'],['l-m'])],['c-d'])]

For safety, you can even escape any instance of the ¶ that might have occurred in the string before making the change, then unescaping them as the last step, if you think it's necessary.

I don't know Python, so this might not work the same way as PHP. You may need to use an array instead of dynamic variables.

Patrick
+2  A: 

So, here you are, your "honest python parser". Coding for you rather than answering the question, but I will be fine if you put it to use :-)

QUOTE = '"'
SEP = ',(){}"'
S_BRACKET = '{'
E_BRACKET = '}'
S_PAREN = '('

def parse_plain(string):
    counter = 0
    token = ""
    while counter<len(string):
        if string[counter] in SEP:
            counter += 1
            break
        token += string[counter]
        counter += 1
    return counter, token

def parse_bracket(string):
    counter = 1
    fwd, token = parse_plain(string[counter:])
    output = [token]
    counter += fwd
    fwd, token = parse_(string[counter:])
    output += token
    counter += fwd
    output = [tuple(output)]
    return counter, output

def parse_quote(string):
    counter = 1
    output = []
    while counter<len(string):
        if counter > 1 and string[counter - 1] == QUOTE:
            counter += 1
            break
        fwd, token = parse_plain(string[counter:])
        output.append(token)
        counter += fwd
    return counter, output

def parse_(string):
    output = []
    counter = 0
    while counter < len(string):
        if string[counter].isalpha():
            fwd, token = parse_plain(string[counter:])
            token = [token]
        elif string[counter] == QUOTE:
            fwd, token = parse_quote(string[counter:])
        elif string[counter] == S_BRACKET:
            fwd, token = parse_bracket(string[counter:])
        elif string[counter] == E_BRACKET:
            counter += 1
            break
        else:
            counter += 1
            continue
        output.append(token)
        counter += fwd
    return counter, output

def parse(string):
    return parse_(string)[1]

And testing the output:

>>> print parse('''"a,b,c",d,"e,f"''')
[['a', 'b', 'c'], ['d'], ['e', 'f']]
>>> print parse('''"a,b,c",d,e ''')
[['a', 'b', 'c'], ['d'], ['e ']]
>>> print parse('''a,b,"c,d,e,f"''')
[['a'], ['b'], ['c', 'd', 'e', 'f']]
>>> print parse('''a,"b,c,d",{x(a,b,c-d)}''')
[['a'], ['b', 'c', 'd'], [('x', ['a'], ['b'], ['c-d'])]]
>>> print parse('''{x(a,{y("b,c,d",e)})},z''')
[[('x', ['a'], [('y', ['b', 'c', 'd'], ['e'], ['z'])])]]
>>>
jsbueno
when in doubt, fall back to basic parsing techniques!
Sean Woods