ansaurus

Question

Regular expression to parse a commented configuration file

Answer 1

A:

I wouldn't use a regex for this at all, for the same reason I wouldn't try kill a fly with a thermo-nuclear warhead.

Assuming you're reading a line in at a time, just:

if the first character is a #, set comment to the whole line and empty the line.
otherwise, find the first occurrence of # not immediately after \, set the comment to that plus the rest of the line and set the line to everything before that.
replace all occurrences of \# in the line with #.

That's it, you now have a proper line and a comment section. Use regexes to split up the new line section by all means.

For example:

import re

def fn(line):
    # Split line into non-comment and comment.

    comment = ""
    if line[0] == "#":
        comment = line
        line = ""
    else:
        idx = re.search (r"[^\\]#", line)
        if idx != None:
            comment = line[idx.start()+1:]
            line = line[:idx.start()+1]

    # Split non-comment into key and value.

    idx = re.search (r"=", line)
    if idx == None:
        key = line
        val = ""
    else:
        key = line[:idx.start()]
        val = line[idx.start()+1:]
    val = val.replace ("\\#", "#")

    return (key.strip(),val.strip(),comment.strip())

print fn(r"someoption1 = some value # some comment")
print fn(r"# this line is only a comment")
print fn(r"someoption2 = some value with an escaped \# hash")
print fn(r"someoption3 = some value with a \# hash # some comment")

produces:

('someoption1', 'some value', '# some comment')
('', '', '# this line is only a comment')
('someoption2', 'some value with an escaped # hash', '')
('someoption3', 'some value with a # hash', '# some comment')

If you must use a regex (against my advice), your specific problem lies here:

[^\#]

This (assuming you meant the properly escaped r"[^\\#]") will attempt to match any character other than either \ or #, not the sequence \# as you desire. You can use negative look-behinds to do it but I always say that, once a regular expression becomes unreadable to a moron in a hurry, it's better to revert to procedural code :-)

On reflection, a better way to do it is with a multi-level split (so the regex doesn't have to get too hideous by handling missing fields), as follows:

def fn(line):
    line = line.strip()                            # remove spaces
    first = re.split (r"\s*(?<!\\)#\s*", line, 1)  # get non-comment/comment
    if len(first) == 1: first.append ("")          # ensure we have a comment
    first[0] = first[0].replace("\\#","#")         # unescape non-comment

    second = re.split (r"\s*=\s*", first[0], 1)    # get key and value
    if len(second) == 1: second.append ("")        # ensure we have a value
    second.append (first[1])                       # create 3-tuple
    return second                                  # and return it

This uses the negative look-ahead to correctly match the comment separator then separates the non-comment bit into key and value. Spaces are handled correctly in this one as well, yielding:

['someoption1', 'some value', 'some comment']
['', '', 'this line is only a comment']
['someoption2', 'some value with an escaped # hash', '']
['someoption3', 'some value with a # hash', 'some comment']

paxdiablo 2010-09-24 01:37:59

Point taken. I realize it'd be pretty easy to write it another way :P I mostly just want to know why that regex doesn't work. That's why I asked the question!

apeace 2010-09-24 01:51:27

Answer 2

+1 A:

I've left a comment about the purpose of this question, but supposing this question is purely about regular expressions, I'll still give the answer a shot.

Assuming you're dealing with input one line at a time, I would go about this as a two-pass phase. This means you'll have 2 regular expressions.

Something along the lines of (.*?(?<!\\))#(.*): split at first # not preceeded by \ (see documentation on negative lookbehind);
Assignment statement expression parsing.

André Caron 2010-09-24 01:48:34

This seems to be what I was looking for. I'll go look up negative lookbehinds. Thanks for the tip!

apeace 2010-09-24 01:56:15

Answer 3

A:

Try breaking it down into 2 steps:

Escape processing to recognise true comments (first # not preceded by \ (hint: "negative lookbehind")), remove true comments, then replace r"\#" by "#"
Process the comment-free remainder.

BIG HINT: use re.VERBOSE with comments

John Machin 2010-09-24 01:48:56

Answer 4

+1 A:

The reason your regular expression isn't matching as you want is because of the greedy matching behaviour of regular expressions: each part will match the longest substring such that the rest of the string can still be matched with the remainder of the regular expression

What this means in the case of one of your lines with an escaped # is:

The [^\#]* (there's no need to escape # btw) will match everything before the first hash, including the backslash before it
The (\\\#)* won't match anything, as the string at this point starts with a #
The final (\#.*) will match the rest of the string

A simple example to emphasise this potentially unintuitive behaviour: in the regular expression (a*)(ab)(b*), the (ab) will never match anything

I believe this regular expression (based on the original one) should work: ^\s*(\S+\s*=([^\\#]|\\#?)*)?(#.*)?$

dave 2010-09-24 02:07:48

Thanks for the info dude!

apeace 2010-09-24 02:12:56

Oh, and hashes have to be escaped in regex's in Python. Now you know!

apeace 2010-09-24 02:13:17

@apeace Really? The python documentation makes no mention of this, and I seem to be able to use unescaped #s without issue...

dave 2010-09-24 02:21:32

@apeace: `#` is a magic character in Python re syntax **only** in re.VERBOSE mode ... which you should be using unless your code is write-only

John Machin 2010-09-24 04:17:14

@dave: Thanks! Your regex works perfectly.

apeace 2010-09-27 00:51:56

@john-machin: Yep--you're right about `#`. Totally misread the doc on that one.

apeace 2010-09-27 00:52:26

Answer 5

+1 A:

I would use this regular expression in multiline mode:

^\s*([a-zA-Z_][a-zA-Z_0-9]*)\s*=\s*((?:[^\\#]|\\.)+)

This allows any character to be escaped (\\.). If you just want to allow the #, use \\# instead.

Gumbo 2010-09-24 05:49:11

ansaurus

tags:

views:

answers:

Regular expression to parse a commented configuration file

related questions