ansaurus

Question

Answer 1

+1 A:

Regex are greedy, which means that .* matches as much as it can, the entire string. So there is nothing left to match for the optional group. group(0) is always the entire matched sting.

From you comment, why do you event want regex? Isn't something like this enough:

if line.startswith('MIS Notes:'): # starts with that string
    data = line[len('MIS Notes:'):] # the rest in the interesting part
    if '//p//' in data:
        stuff, sep, rest = data.partition('//p//') # or sothing like that
    else:
        pass #other stuff

THC4k 2010-05-03 18:07:59

Thanks for the quick comment ultimately I'm trying to match any string that has the pattern MIS Notes: + anything or nothing before the string //p// and anything or nothing after it. and naturally //p// is optional as well. So I guess I need a way to prevent the beginning greedyness. Sorry if this description isn't clear enough let me know and I can try to clarify.

AaronG 2010-05-03 18:12:25

Edited question to try and clarify purpose / regex reasoning. Thanks.

AaronG 2010-05-03 18:45:06

Answer 2

A:

The pattern is ambiguous for your purposes. It would be good to group them by prefix or suffix. In the example here, I've chosen prefix grouping. Basically, if //p// occurs in the line, then prefix is non-empty. Suffix will everything after the //p// item, or everything in the line if it doesn't exist.

import re
lines = ['MIS Notes: //p//',
    'MIS Notes: prefix//p//suffix']

tickettypepat = (r'MIS Notes: (?:(.*)//p//)?(.*)')
for line in lines:
    m = re.search(tickettypepat,line)
    print 'line:', line
    if m: print 'groups:', m.groups()
    else: print 'groups:', m

results:

line: MIS Notes: //p//
groups: ('', '')
line: MIS Notes: prefix//p//suffix
groups: ('prefix', 'suffix')

Shane Holloway 2010-05-03 18:30:39

Unless I'm misunderstanding your solution I'm not sure that does what I want. Ideally the regex would return always return //p// //e// or //w// in a specific group if it was present in the input.

AaronG 2010-05-03 18:54:34

Answer 3

+1 A:

MIS Notes:.*(//p//)?.* works like this, on the example of "MIS Notes: //p//" as the target:

MIS Notes: matches "MIS Notes:", no surprises here.
.* immediately runs to the end of the string (match so far "MIS Notes: //p//")
(//p//)? is optional. Nothing happens.
.* has nothing left to match, we are at the end of the string already. Since the star allows zero matches for the preceding atom, the regex engine stops reporting the entire string as a match, and the sub-group as empty.

Now when you change the regex to MIS Notes:.*(//p//).*, the behavior changes:

MIS Notes: matches "MIS Notes:", still no surprises here.
.* immediately runs to the end of the string (match so far "MIS Notes: //p//")
(//p//) is necessary. The engine starts to backtrack character by character in order to fulfill this requirement. (Match so far "MIS Notes: ")
(//p//) can match. Sub-group one is saved and contains "//p//".
.* runs to the end of the string. Hint: If you are not interested in what it matches, it is superfluous and you can remove it.

Now when you change the regex to MIS Notes:.*?//(p)//, the behavior changes again:

MIS Notes: matches "MIS Notes:", and still no surprises here.
.*? is non-greedy and checks the following atom before it proceeds (match so far "MIS Notes: ")
//(p)// can match. Sub-group one is saved and contains "p".
Done. Note that no backtracking occurs, this saves time.

Now if you know that there can be no / before the //p//, you can use: MIS Notes:[^/]*//(p)//:

MIS Notes: matches "MIS Notes:", you get the idea.
[^/]* can fast-forward to the first slash (this is faster than .*?)
//(p)// can match. Sub-group one is saved and contains "p".
Done. Note that no backtracking occurs, this saves time. This should be faster than version #3.

Tomalak 2010-05-03 18:51:54

Thanks for the thorough explanation I think I understand this now, I have to run to a meeting for a tic, but upon return will test.

AaronG 2010-05-03 18:59:10

I tried it and I think we are close MIS Notes:.*?(//[pewPEW]) matches all the cases where theres a //*// tag of some sort. But it breaks on elif retype.group(1) == ('' or None): line = line + 'Type: Ticket\n' print 'Setting type for... ' + line Whereas MIS Notes:.*?(//[pewPEW]//)? seems to give null for group(1) no matter what. Same thing with MIS Notes:.*?(//[pewPEW]//)*

AaronG 2010-05-03 20:06:40

I obviously missed the trailing // in my first example in the last comment. Should have read, MIS Notes:.*?(//[pewPEW]//)

AaronG 2010-05-03 20:32:02

@Aaron: As indicated, `*` and `?` both are optional. My case #1 applies to both. I'm not sure what you mean by "breaks"?

Tomalak 2010-05-03 20:33:11

Sorry about ambiguity on breaks. If i run with MIS Notes:.*?(//[pewPEW]//) it returns FALSE on line23 in my post above >> 'elif retype:' thus skipping the logic chain below it and ultimately doing nothing for lines that don't have //*// in them. Where it should be marking those as Ticket.

AaronG 2010-05-03 20:49:38

@Aaron: This works: `MIS Notes:(?:(?!//[pewPEW]//).)*(?://([pewPEW])//)?` but it's ugly and I'm trying to come up with a nicer version.

Tomalak 2010-05-03 21:22:02

@Aaron: The explanation why sth. in the form of `.*?(x)?` won't match is that `.*?` has no reason to move forward if the rest of the expression is optional. And since the rest of the expression is optional, the engine finishes immediately.

Tomalak 2010-05-03 21:37:19

MIS Notes:(?:(?!//[pewPEW]//).)*(?://([pewPEW])//)? Confirmed working with my data. Thanks, again. Pretty sure I would have never come up with that.

AaronG 2010-05-03 21:48:32

@AaronG: Having thought about this some more, I don't think there is a much simpler regex to accomplish what you want in one step. Since *"Simple is better than complex"*, I would recommend working in two steps, matching `MIS Notes:` to find the line, and then matching `//([pewPEW])//` in a second step. This would be much more obvious and maintainable.

Tomalak 2010-05-03 22:01:13

ansaurus

tags:

views:

answers:

Regex optional match in python fails

Takes Text and cleans up / converts some things.

related questions