views:

115

answers:

2

I am trying to figure out why a regex with negative look ahead fails when the "single line" option is turned on.

Example (simplified):

<source>Test 1</source>
<source>Test 2</source>
<target>Result 2</target>
<source>Test 3</source>

This:

<source>(?!.*<source>)(.*?)</source>(?!\s*<target)

will fail if the single line option is on, and will work if the single line option is off. For instance, this works (disables the single line option):

(?-s:<source>(?!.*<source>)(.*?)</source>(?!\s*<target))

My understanding is that the single line mode simply allows the dot "." to match new lines, and I don't see why it would affect the expression above.

Can anyone explain what I am missing here?

::::::::::::::::::::::

EDIT: (?!.*) is a negative look ahead not a capturing group.

 <source>(?!.*?<source>)(.*?)</source>(?!\s*<target)

will ALSO FAIL if the single line mode is on, so it doesn't look like this is a greediness issue. Try it in a Regex designer (like Expresso or Rad regex):

With single line OFF, it matches (as expected):

<source>Test 1</source>    
<source>Test 3</source>

With single line ON:

<source>Test 3</source>

I don't understand why it doesn't match the first one as well: it does not contain the first negative look ahead, so it should match the expression.

+2  A: 

The reason why it "fails" is because you seem to have misplaced the negative lookahead.

<source>(?!.*<source>)(.*?)</source>(?!\s*<target)
        ^^^^^^^^^^^^^^

Now, let's consider what (?!.*<source>) does here: it's a lookahead that says that there is NO match for .*<source> from that position.

Well, in single-line mode, . matches everything. After matching the first two <source>, there IS in fact .*<source>! So the negative lookahead fails for the first two <source>.

On the last <source>, .*<source> no longer match, so the negative lookahead succeeds. The rest of the pattern also succeeds, and that's why you only get <source>Test 3</source> in single-line mode.

polygenelubricants
Using negative character classes is simpler and quicker: `<source>([^<]*)</source>(?!\s*<target>)`
Pent Ploompuu
Aaah! Now I get it. Thanks!
Sylverdrag
@pent: I can't use character classes in this case because the source tag can contain other tags (and square brackets) which also need to be matched.
Sylverdrag
+2  A: 

I believe this is what you're looking for:

<source>((?:(?!</?source>).)*)</source>(?!\s*<target)

The idea is that you match each character one at a time, but only after making sure it isn't the first character of </source>. Also, with the addition of /? to the lookahead, you don't have to use a non-greedy quantifier.

Alan Moore
+1; I made a mistake in my suggested "fix" in the comment (now deleted). This one works.
polygenelubricants
Very nice. Thanks a lot, Alan!
Sylverdrag