tags:

views:

123

answers:

4

Although this question is similar to this thread

I think I might be doing something wrong at the time of constructing the code with the Regular Expression.

I want to match anything in a line up to a comment ("#") or the end of the line (if it doesn't have a comment).

The regex I am using is: (.*)(#|$)

(.*) = Everything
(#|$) = comment or end of line

The code:

OPTION = re.compile(r'(?P<value>.*)(#|$)')
file = open('file.txt')
lines = file.read()
for line in lines.split('\n'):
    get_match = OPTION.match(line)
    if get_match:
        line_value = get_match.group('value')
        print "Match=  %s" % line_value

The above works but does not strip out the comment. If the file has a line like:

this is a line   # and this is a comment

I still get the whole line when running the code.

Am I missing additional values/information in the regular expression or do I need to have a change on the code?

+3  A: 

Here's the correct regex to do something like this:

([^#]*)(#.*)?

Also, why don't you just use

file = open('file.txt')
for line in file:
Can Berk Güder
As I understand it, the OP doesn't want to match the comment at all, so you can drop the second part of your regex: (#.*)?
Alan Moore
+3  A: 

The * is greedy (consumes as much of the string as it can) and is thus consuming the entire line (past the # and to the end-of-line). Change ".*" to ".*?" and it will work.

See the Regular Expression HOWTO for more information.

Benji York
I went through the documentation for the RE module, but didn't quite understand the "greedy" explanation as goo as you pointed out. Thanks for a great answer :)
alfredodeza
A: 

Use this regular expression:

^(.*?)(?:#|$)

With the non-greedy modifier (?), the .* expression will match as soon as either a hash sign or end-of-line is reached. The default is to match as much as possible, and that is why you always got the whole line.

ΤΖΩΤΖΙΟΥ
+1  A: 

@Can, @Benji and @ ΤΖΩΤΖΙΟΥ give three excellent solutions, and it's fun to time them to see how fast they match (that's what timeit is for -- fun meaningless micro-benchmarks;-). E.g.:

$ python -mtimeit -s'import re; r=re.compile(r"([^#]*)(#.*)?"); s="this is a line   # and this is a comment"' 'm=r.match(s); g=m.group(1)'
100000 loops, best of 3: 2.02 usec per loop

vs

$ python -mtimeit -s'import re; r=re.compile(r"^(.*?)(?:#|$)"); s="this is a line   # and this is a comment"' 'm=r.match(s); g=m.group(1)'
100000 loops, best of 3: 4.19 usec per loop

vs

$ python -mtimeit -s'import re; r=re.compile(r"(.*?)(#|$)"); s="this is a line   # and this is a comment"' 'm=r.match(s); g=m.group(1)'
100000 loops, best of 3: 4.37 usec per loop

and the winner is... a mix of the patterns!-)

$ python -mtimeit -s'import re; r=re.compile(r"(.*?)(#.*)?"); s="this is a line   # and this is a comment"' 'm=r.match(s); g=m.group(1)'
1000000 loops, best of 3: 1.73 usec per loop

Disclaimer: of course if this were a real benchmarking exercise and speed did truly matter, one would try on many different and relevant values for s, on tests beyond such a microbenchmark, etc, etc. But, I still find timeit an inexhaustible source of fun!-)

Alex Martelli