views:

101

answers:

5

Edit: I'm really just curious as to how I can get this regex to work. Please don't tell me there are easier ways to do it. That's obvious! :P

I'm writing a regular expression (using Python) to parse lines in a configuration file. Lines could look like this:

someoption1 = some value # some comment
# this line is only a comment
someoption2 = some value with an escaped \# hash
someoption3 = some value with a \# hash # some comment

The idea is that anything after a hash symbol is considered to be a comment, except if the hash is escaped with a slash.

I'm trying to use a regex to break each line into its individual pieces: leading whitespace, left side of the assignment, right side of the assignment, and comment. For the first line in the example, the breakdown would be:

  • Whitespace: ""
  • Assignment left: "someoption1 ="
  • Assignment right: " some value "
  • Comment "# some comment"

This is the regex I have so far:

^(\s)?(\S+\s?=)?(([^\#]*(\\\#)*)*)?(\#.*)?$

I'm terrible with regex, so feel free to tear it apart!

Using Python's re.findAll(), this is returning:

  • 0th index: the whitespace, as it should be
  • 1st index: the left side of the assignment
  • 2nd index: The right side of the assignment, up to the first hash, whether escaped or not (which is incorrect)
  • 5th index: The first hash, whether escaped or not, and anything after it (which is incorrect)

There's probably something fundamental about regular expressions that I'm missing. If somebody can solve this I'll be forever grateful...

A: 

I wouldn't use a regex for this at all, for the same reason I wouldn't try kill a fly with a thermo-nuclear warhead.

Assuming you're reading a line in at a time, just:

  • if the first character is a #, set comment to the whole line and empty the line.
  • otherwise, find the first occurrence of # not immediately after \, set the comment to that plus the rest of the line and set the line to everything before that.
  • replace all occurrences of \# in the line with #.

That's it, you now have a proper line and a comment section. Use regexes to split up the new line section by all means.

For example:

import re

def fn(line):
    # Split line into non-comment and comment.

    comment = ""
    if line[0] == "#":
        comment = line
        line = ""
    else:
        idx = re.search (r"[^\\]#", line)
        if idx != None:
            comment = line[idx.start()+1:]
            line = line[:idx.start()+1]

    # Split non-comment into key and value.

    idx = re.search (r"=", line)
    if idx == None:
        key = line
        val = ""
    else:
        key = line[:idx.start()]
        val = line[idx.start()+1:]
    val = val.replace ("\\#", "#")

    return (key.strip(),val.strip(),comment.strip())

print fn(r"someoption1 = some value # some comment")
print fn(r"# this line is only a comment")
print fn(r"someoption2 = some value with an escaped \# hash")
print fn(r"someoption3 = some value with a \# hash # some comment")

produces:

('someoption1', 'some value', '# some comment')
('', '', '# this line is only a comment')
('someoption2', 'some value with an escaped # hash', '')
('someoption3', 'some value with a # hash', '# some comment')

If you must use a regex (against my advice), your specific problem lies here:

[^\#]

This (assuming you meant the properly escaped r"[^\\#]") will attempt to match any character other than either \ or #, not the sequence \# as you desire. You can use negative look-behinds to do it but I always say that, once a regular expression becomes unreadable to a moron in a hurry, it's better to revert to procedural code :-)


On reflection, a better way to do it is with a multi-level split (so the regex doesn't have to get too hideous by handling missing fields), as follows:

def fn(line):
    line = line.strip()                            # remove spaces
    first = re.split (r"\s*(?<!\\)#\s*", line, 1)  # get non-comment/comment
    if len(first) == 1: first.append ("")          # ensure we have a comment
    first[0] = first[0].replace("\\#","#")         # unescape non-comment

    second = re.split (r"\s*=\s*", first[0], 1)    # get key and value
    if len(second) == 1: second.append ("")        # ensure we have a value
    second.append (first[1])                       # create 3-tuple
    return second                                  # and return it

This uses the negative look-ahead to correctly match the comment separator then separates the non-comment bit into key and value. Spaces are handled correctly in this one as well, yielding:

['someoption1', 'some value', 'some comment']
['', '', 'this line is only a comment']
['someoption2', 'some value with an escaped # hash', '']
['someoption3', 'some value with a # hash', 'some comment']
paxdiablo
Point taken. I realize it'd be pretty easy to write it another way :P I mostly just want to know why that regex doesn't work. That's why I asked the question!
apeace
+1  A: 

I've left a comment about the purpose of this question, but supposing this question is purely about regular expressions, I'll still give the answer a shot.

Assuming you're dealing with input one line at a time, I would go about this as a two-pass phase. This means you'll have 2 regular expressions.

  1. Something along the lines of (.*?(?<!\\))#(.*): split at first # not preceeded by \ (see documentation on negative lookbehind);
  2. Assignment statement expression parsing.
André Caron
This seems to be what I was looking for. I'll go look up negative lookbehinds. Thanks for the tip!
apeace
A: 

Try breaking it down into 2 steps:

  1. Escape processing to recognise true comments (first # not preceded by \ (hint: "negative lookbehind")), remove true comments, then replace r"\#" by "#"

  2. Process the comment-free remainder.

BIG HINT: use re.VERBOSE with comments

John Machin
+1  A: 

The reason your regular expression isn't matching as you want is because of the greedy matching behaviour of regular expressions: each part will match the longest substring such that the rest of the string can still be matched with the remainder of the regular expression

What this means in the case of one of your lines with an escaped # is:

  • The [^\#]* (there's no need to escape # btw) will match everything before the first hash, including the backslash before it
  • The (\\\#)* won't match anything, as the string at this point starts with a #
  • The final (\#.*) will match the rest of the string

A simple example to emphasise this potentially unintuitive behaviour: in the regular expression (a*)(ab)(b*), the (ab) will never match anything

I believe this regular expression (based on the original one) should work: ^\s*(\S+\s*=([^\\#]|\\#?)*)?(#.*)?$

dave
Thanks for the info dude!
apeace
Oh, and hashes have to be escaped in regex's in Python. Now you know!
apeace
@apeace Really? The python documentation makes no mention of this, and I seem to be able to use unescaped #s without issue...
dave
@apeace: `#` is a magic character in Python re syntax **only** in re.VERBOSE mode ... which you should be using unless your code is write-only
John Machin
@dave: Thanks! Your regex works perfectly.
apeace
@john-machin: Yep--you're right about `#`. Totally misread the doc on that one.
apeace
+1  A: 

I would use this regular expression in multiline mode:

^\s*([a-zA-Z_][a-zA-Z_0-9]*)\s*=\s*((?:[^\\#]|\\.)+)

This allows any character to be escaped (\\.). If you just want to allow the #, use \\# instead.

Gumbo