ansaurus

Question

Extracting ALL matches of a nested regular expression in python

Answer 1

+2 A:

I don't think that regex is the right tool here. Try csv module:

>>> s = 'foo "bar baz" "bob" '
>>> for i in csv.reader([s], delimiter=' '):
    print(i)


['foo', 'bar baz', 'bob', '']

SilentGhost 2009-10-27 22:50:49

Thanks for the suggestion. However in my application, I need to distinguish between items that are quoted and ones that aren't, where as this doesn't seem to do that.

sligocki 2009-10-27 23:08:55

and what is the purpose of this requirement?

SilentGhost 2009-10-27 23:12:47

sligocki 2009-10-27 23:15:53

can't you tell which should be quoted from the order?

SilentGhost 2009-10-27 23:20:10

no, they could be in any order. For example: 'foo bob ' 'foo "bob" ' '"foo" bob ' '"foo" "bob" 'are all valid inputs that should be mutually destinguishable

sligocki 2009-10-27 23:25:39

may be I'm mistaken but you input doesn't seem to be regular.

SilentGhost 2009-10-27 23:59:41

It's a little weird, but consider it to be an expression which mixes strings and variables. Thus 'foo "bar"' is the variable foo and the string "bar", I need to distinguish that foo is a variable and "bar" is a string.

sligocki 2009-10-28 00:18:01

so, there is an order!

SilentGhost 2009-10-28 00:23:01

What do you mean? Of course there is an order in the expression, but variables and strings could be in any order. Like I said, 'foo "bar"' and '"foo" bar' are both allowed but distinguishable. The first has variable foo then string "bar" the second had string "foo" then variable bar.

sligocki 2009-10-28 00:26:34

Answer 2

+1 A:

Here's a solution that splits on any whitespace that isn't inside a pair of quotation marks:

re.split('\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)', target)

The lookahead succeeds only if there's an even number of quotation marks ahead of the just-matched whitespace. If quoted sections in your text can contain escaped quotes, you may need a more complicated regex, depending on how the escaping is done.

Alan Moore 2009-10-28 00:10:03

Wow, that seems to work, now if I can just figure out how it works :)

sligocki 2009-10-28 00:16:15

Er, actually, there is a problem, this never rejects expressions, so if target = 'foo"bar baz" "bob"', it returns ['foo"bar baz"', '"bob"'].

sligocki 2009-10-28 00:22:43

I suppose I could just check that each output is in the right format and fail if not, though.

sligocki 2009-10-28 00:23:31

Does your input really contain quoted and non-quoted sections run together like that? None of your examples do.

Alan Moore 2009-10-28 00:25:49

sligocki 2009-10-28 00:39:02

However, like I said, I could easily check each element in the list after using your expression runs and if all of the elements are well-formed, then I think the whole thing must be well-formed. So, no worries.

sligocki 2009-10-28 00:40:39

Answer 3

A:

Alright, I ended up deciding to do this in two steps.

First I check that the expression is syntactically valid and second I break it into individual pieces:

def parse(expr):
    if re.match(r'\A(("[\w\s]+"|\w+)\s+)*\Z', expr):
        return re.findall(r'("[\w\s]+"|\w+)', expr)

So:

>>> parse('foo "bar baz" "bob" ')
['foo', '"bar baz"', '"bob"']
>>> parse('foo "bar b-&&az" "bob" ')
>>> parse('foo "bar" ')
['foo', '"bar"']
>>> parse('"foo" bar ')
['"foo"', 'bar']
>>> parse('foo"bar baz" "bob" ')
>>> parse('&&')

I'm about 90% sure that this method works appropriately for all strings, but I would still be interested if anyone had a more general solution, this seems sort of kludgey to me.

Thanks SilentGhost and Alan Moore for the help. I did not know about python csv or regex lookaheads before, it might be helpful to me to learn about those.

sligocki 2009-10-30 20:26:46

ansaurus

tags:

views:

answers:

Extracting ALL matches of a nested regular expression in python

related questions