tags:

views:

118

answers:

5

I have this .NET regex:

^(?<prefix>("[^"]*"))\s(?<attrgroup>(\([^\)]*\)))\s(?<suffix>("[^"]*"))$

It properly matches the following strings:

"some prefix" ("attribute 1" "value 1") "some suffix"
"some prefix" ("attribute 1" "value 1" "attribute 2" "value 2") "some suffix"

It fails on...

"some prefix" ("attribute 1" "value (fail) 1") "some suffix"

...due to the right paren after "fail".

How can I modify my regex so that the attrgroup match group will end up containing "("attribute 1" "value (fail) 1")"? I've been looking at it for too long and need some fresh eyes. Thanks!

Edit: attrgroup won't ever contain anything other than pairs of double-quoted strings.

+1  A: 
^(?<prefix>"[^"]*")\s+(?<attrgroup>\(.*\))\s+(?<suffix>"[^"]*")$

fixed it for me.

I removed the extraneous unnamed groups and simplified (down to "any character") the attribute group.

A very worthwhile investment would be JG Soft's RegexBuddy

Edit: This won't validate the attribute group as pairs of quoted strings, but that should/could be done in a separate regex/validation step.

hometoast
A: 

Hometoasts solution is a good one, though like any liberal regex it should only be used to extract data from sources you have a reasonable assurance are well formed and not for validation.

ICR
A: 

Without addressing the specifics of this regex, I would recommend using a Regex tool to help build, test, and validate your regular expressions. For anything non-trivial, or expressions you may need to maintain/update, these sort of tools are essential.

Check out...

The Regex Coach - Written in Lisp, a bit older, but I really prefer this one to others.

Rad Software Regex Designer - .NET and more "modern" perhaps. Some may like this one.

theraccoonbear
+2  A: 

my, untested guess:

^(?<prefix>("[^"]*"))\s(?<attrgroup>(\(("[^"]*")(\s("[^"]*")*)**\)))\s(?<suffix>("[^"]*"))$

hereby I've replaced

[^\)]*

with

("[^"]*")(\s("[^"]*")*)*

I assumed everything within the parenthesis is either between double quotes, or is a whitespace.

If you want to know how I came up with this, read Mastering Regular Expressions.

ps. if I'm correct, then this will also validate attribute group as pairs of quoted string.

Patrick Huizinga
You are mixing up '/' and '\'. This is probably closer to what you want: ^(?<prefix>("[^"]*"))\s(?<attrgroup>(\((\s+|"[^"]*")*\)))\s(?<suffix>("[^"]*"))$
MizardX
whoops! thanks edited
Patrick Huizinga
A: 

I suggest using a parser that is able to handle such structures. The regular expression fails, and this is correct, as the language you try to parse doesn't look regular -- at least from the examples given above. Whenever you need to recognize nesting, regexps will either fail or grow into complicated beasts like that one above. Even if the language is regular, that regular expression up there looks way too complicated for me. I'd rather use something like this:

def parse_String(string):
    index = skip_spaces(string, 0)
    index, prefix = read_prefix(string, index)
    index = skip_spaces(string, index)
    index, attrgroup = read_attrgroup(string, index)
    index = skip_spaces(string, index)
    index, suffix = read_suffix(string, index)
    return prefix, attrgroup, suffix

def read_prefix(string, start_index):
    return read_quoted_string(string, start_index) 

def read_attrgroup(string, start_index):
    end_index, content = read_paren(string, start_index)

    index = skip_spaces(content, 0)
    index, first_entry = read_quoted_string(content, index)
    index = skip_spaces(content, index)
    index, second_entry = read_quoted_string(content, index)
    return end_index, (first_entry, second_entry)


def read_suffix(string, start_index):
    return read_quoted_string(string, start_index)

def read_paren(string, start_index):
    return read_delimited_string(string, start_index, '(', ')')

def read_quoted_string(string, start_index):
    return read_delimited_string(string, start_index, '"', '"')

def read_delimited_string(string, starting_index, start_limiter, end_limiter):
    assert string[starting_index] == start_limiter, (start_limiter 
                                                     +"!=" 
                                                     +string[starting_index])
    current_index = starting_index+1
    content = ""
    while(string[current_index] != end_limiter):
        content += string[current_index]
        current_index += 1

    assert string[current_index] == end_limiter
    return current_index+1, content

def skip_spaces(string, index):
    while string[index] == " ":
        index += 1
    return index

yes, this is more code, and yes, by raw number of keys, this took longer. However -- at least for me -- my solution is far easier to verify. This increases even more if you remove a bunch of the string-and-index-plumbing by moving all of that into the class, which parses such strings in it's constructor. Furthermore, it is easy to make the space-skipping implicit (using some magic next-char method which just skips chars until a non-space appears, unless it is in some non-skip mode due to strings. This mode can be set in the delimited-function, for example). This would turn the parse_string into:

def parse_string(string):
    prefix = read_prefix()
    attrgroup = read_attr_group()
    suffix = read_suffix()
    return prefix, attrgroup, suffix.

Furthermore, this functions can be extended easier to cover more complicated expressions. Arbitrarily nested attrgroups? a change of one line of code. Nested parens? a bit more work, but no real problem.

Now, please flame and downvote me for being some regex-heretic and some parser-advocator. >:)

PS: yes, that code is untested. as I know myself, there are 3 typos in there I did not see.

Tetha