I suggest using a parser that is able to handle such structures. The regular expression fails, and this is correct, as the language you try to parse doesn't look regular -- at least from the examples given above. Whenever you need to recognize nesting, regexps will either fail or grow into complicated beasts like that one above. Even if the language is regular, that regular expression up there looks way too complicated for me. I'd rather use something like this:
def parse_String(string):
index = skip_spaces(string, 0)
index, prefix = read_prefix(string, index)
index = skip_spaces(string, index)
index, attrgroup = read_attrgroup(string, index)
index = skip_spaces(string, index)
index, suffix = read_suffix(string, index)
return prefix, attrgroup, suffix
def read_prefix(string, start_index):
return read_quoted_string(string, start_index)
def read_attrgroup(string, start_index):
end_index, content = read_paren(string, start_index)
index = skip_spaces(content, 0)
index, first_entry = read_quoted_string(content, index)
index = skip_spaces(content, index)
index, second_entry = read_quoted_string(content, index)
return end_index, (first_entry, second_entry)
def read_suffix(string, start_index):
return read_quoted_string(string, start_index)
def read_paren(string, start_index):
return read_delimited_string(string, start_index, '(', ')')
def read_quoted_string(string, start_index):
return read_delimited_string(string, start_index, '"', '"')
def read_delimited_string(string, starting_index, start_limiter, end_limiter):
assert string[starting_index] == start_limiter, (start_limiter
+"!="
+string[starting_index])
current_index = starting_index+1
content = ""
while(string[current_index] != end_limiter):
content += string[current_index]
current_index += 1
assert string[current_index] == end_limiter
return current_index+1, content
def skip_spaces(string, index):
while string[index] == " ":
index += 1
return index
yes, this is more code, and yes, by raw number of keys, this took longer. However -- at least for me -- my solution is far easier to verify. This increases even more if you remove a bunch of the string-and-index-plumbing by moving all of that into the class, which parses such strings in it's constructor. Furthermore, it is easy to make the space-skipping implicit (using some magic next-char method which just skips chars until a non-space appears, unless it is in some non-skip mode due to strings. This mode can be set in the delimited-function, for example). This would turn the parse_string into:
def parse_string(string):
prefix = read_prefix()
attrgroup = read_attr_group()
suffix = read_suffix()
return prefix, attrgroup, suffix.
Furthermore, this functions can be extended easier to cover more complicated expressions. Arbitrarily nested attrgroups? a change of one line of code. Nested parens? a bit more work, but no real problem.
Now, please flame and downvote me for being some regex-heretic and some parser-advocator. >:)
PS: yes, that code is untested. as I know myself, there are 3 typos in there I did not see.