views:

92

answers:

3

I am trying to parse a list of items which satisfies the python regex

r'\A(("[\w\s]+"|\w+)\s+)*\Z'

that is, it's a space separated list except that spaces are allowed inside quoted strings. I would like to get a list of items in the list (that is of items matched by the

r'("[\w\s]+"|\w+)'

part. So, for example

>>> parse('foo "bar baz" "bob" ')
['foo', '"bar baz"', '"bob"']

Is there any nice way to do this with python re?

Many things don't quite work. For example

>>> re.match(r'\A(("[\w\s]+"|\w+)\s+)*\Z', 'foo "bar baz" "bob" ').group(2)
'"bob"'

only returns the last one it matched. On the other hand

>>> re.findall(r'("[\w\s]+"|\w+)', 'foo "bar baz" "bob" ')
['foo', '"bar baz"', '"bob"']

but it also accepts malformed expressions like

>>> re.findall(r'("[\w\s]+"|\w+)', 'foo "bar b-&&az" "bob" ')
['foo', 'bar', 'b', 'az', '" "', 'bob']

So is there any way to use the original regex and get all of the items that matched group 2? Something like

>>> re.match_multigroup(r'\A(("[\w\s]+"|\w+)\s+)*\Z', 'foo "bar baz" "bob" ').group(2)
['foo', '"bar baz"', '"bob"']
>>> re.match_multigroup(r'("[\w\s]+"|\w+)', 'foo "bar b-&&az" "bob" ')
None

Edit: It is important that I preserve the quotes in the output, thus I don't want

>>> re.match_multigroup(r'\A(("[\w\s]+"|\w+)\s+)*\Z', 'foo "bar baz" "bob" ').group(2)
['foo', 'bar baz', 'bob']

because then I don't know if bob was quoted or not.

+2  A: 

I don't think that regex is the right tool here. Try csv module:

>>> s = 'foo "bar baz" "bob" '
>>> for i in csv.reader([s], delimiter=' '):
    print(i)


['foo', 'bar baz', 'bob', '']
SilentGhost
Thanks for the suggestion. However in my application, I need to distinguish between items that are quoted and ones that aren't, where as this doesn't seem to do that.
sligocki
and what is the purpose of this requirement?
SilentGhost
sligocki
can't you tell which should be quoted from the order?
SilentGhost
no, they could be in any order. For example: 'foo bob ' 'foo "bob" ' '"foo" bob ' '"foo" "bob" 'are all valid inputs that should be mutually destinguishable
sligocki
may be I'm mistaken but you input doesn't seem to be regular.
SilentGhost
It's a little weird, but consider it to be an expression which mixes strings and variables. Thus 'foo "bar"' is the variable foo and the string "bar", I need to distinguish that foo is a variable and "bar" is a string.
sligocki
so, there is an order!
SilentGhost
What do you mean? Of course there is an order in the expression, but variables and strings could be in any order. Like I said, 'foo "bar"' and '"foo" bar' are both allowed but distinguishable. The first has variable foo then string "bar" the second had string "foo" then variable bar.
sligocki
+1  A: 

Here's a solution that splits on any whitespace that isn't inside a pair of quotation marks:

re.split('\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)', target)

The lookahead succeeds only if there's an even number of quotation marks ahead of the just-matched whitespace. If quoted sections in your text can contain escaped quotes, you may need a more complicated regex, depending on how the escaping is done.

Alan Moore
Wow, that seems to work, now if I can just figure out how it works :)
sligocki
Er, actually, there is a problem, this never rejects expressions, so if target = 'foo"bar baz" "bob"', it returns ['foo"bar baz"', '"bob"'].
sligocki
I suppose I could just check that each output is in the right format and fail if not, though.
sligocki
Does your input really contain quoted and non-quoted sections run together like that? None of your examples do.
Alan Moore
sligocki
However, like I said, I could easily check each element in the list after using your expression runs and if all of the elements are well-formed, then I think the whole thing must be well-formed. So, no worries.
sligocki
A: 

Alright, I ended up deciding to do this in two steps.

First I check that the expression is syntactically valid and second I break it into individual pieces:

def parse(expr):
    if re.match(r'\A(("[\w\s]+"|\w+)\s+)*\Z', expr):
        return re.findall(r'("[\w\s]+"|\w+)', expr)

So:

>>> parse('foo "bar baz" "bob" ')
['foo', '"bar baz"', '"bob"']
>>> parse('foo "bar b-&&az" "bob" ')
>>> parse('foo "bar" ')
['foo', '"bar"']
>>> parse('"foo" bar ')
['"foo"', 'bar']
>>> parse('foo"bar baz" "bob" ')
>>> parse('&&')

I'm about 90% sure that this method works appropriately for all strings, but I would still be interested if anyone had a more general solution, this seems sort of kludgey to me.

Thanks SilentGhost and Alan Moore for the help. I did not know about python csv or regex lookaheads before, it might be helpful to me to learn about those.

sligocki