tags:

views:

907

answers:

3

Hi,

I can't seem to find an answer to this problem, and I'm wondering if one exists. Simplified example:

Consider a string "nnnn", where I want to find all matches of "nn" - but also those that overlap with each other. So the regex would provide the following 3 matches:

  1. nnnn
  2. nnnn
  3. nnnn

I realize this is not exactly what regexes are meant for, but walking the string and parsing this manually seems like an awful lot of code, considering that in reality the matches would have to be done using a pattern, not a literal string.

A: 

AFAIK, there is no pure regex way to do that at once (ie. returning the three captures you request without loop).

Now, you can find a pattern once, and loop on the search starting with offset (found position + 1). Should combine regex use with simple code.

[EDIT] Great, I am downvoted when I basically said what Jan shown...
[EDIT 2] To be clear: Jan's answer is better. Not more precise, but certainly more detailed, it deserves to be chosen. I just don't understand why mine is downvoted, since I still see nothing incorrect in it. Not a big deal, just annoying.

PhiLho
Beat me to it by 1 second, I'll withdraw my identical answer!
Simon Steele
not true, see "VonC"'s answer
Timothy Khouri
@Timothy: that won't do the capture, and you still have to loop on the results, so I am not sure of the advantages...
PhiLho
@PhiLho: again, not true: you can capture group in a zero-width assertion like a positive look-ahead. See my - completed - answer.
VonC
@PhiLho: I responded to your comment. And, in my opinion, your answer was less precise than Jan's: "the pattern" could refer to 'n', whereas the correct strategy means using 'nn', then start again at offset+1. You may have meant that all along, you just did not explain it.
VonC
@VonC: the question is precise, the pattern have been "nn" all along, I don't see an ambiguity there.
PhiLho
+7  A: 

A possible solution could be to use a positive look behind:

(?<=n)n

It would give you the end position of:

  1. nnnn  
  2. n*n*nn  
  3. nn*n*n


As mentionned by Timothy Khouri, a positive lookahead is more intuitive

I would prefer to his proposition (?=nn)n the simpler form:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2).

That is so because:

  • Any valid regular expression can be used inside the lookahead.
  • If it contains capturing parentheses, the backreferences will be saved.

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).

VonC
Also, you could have done it with a positive look ahead: (?=nn)n ... that says "while ahead is two N's, match an N".
Timothy Khouri
Excuse me, but I still don't see the requested three overlapping captures. You capture two n, but not three groups. If I match (\d\d)(?=(\d\d)) against foo4237bar, I get two captures, not three: 42 and 37 (in both Regex Coach and PCRE Workbench). I am probably thick, so I need more explanations.
PhiLho
Please read again the answer: (\d)(?=(\d)), not (\d\d)(?=(\d\d)): you will have 3 sets of capturing groups: (4)(2), (2)(3), (3)(7)
VonC
+5  A: 

Using a lookahead with a capturing group works, at the expense of making your regex slower and more complicated. An alternative solution is to tell the Regex.Match() method where the next match attempt should begin. Try this:

Regex regexObj = new Regex("nn");
Match matchObj = regexObj.Match(subjectString);
while (matchObj.Success) {
    matchObj = regexObj.Match(subjectString, matchObj.Index + 1); 
}
Jan Goyvaerts
Regular-Expressions.info webmaster... => mandatory + 1. Plus, you are right, of course.
VonC