ansaurus

Question

Need Regex for to match special situations

Answer 1

A:

Well, this works for the first one...

((.)(.))(\2\3)+

rikh 2010-05-18 09:47:32

Can you please give a short explanation of this regex?

Daniel 2010-05-18 11:15:18

Yep. ((.)(.)) matches 2 characters next to each other (whilst also looking suggestive). (\2\3) matches the same 2 characters again, and the + says this must happen 1 or more times. Looking at it, you may be able to ditch the outer brackets in the first set and change the back references accordingly.

rikh 2010-05-18 13:36:11

But it looks more suggestive with the outer parentheses, so please leave them in.

Tim Pietzcker 2010-05-18 14:20:31

For some reason, with the outer parens there it reminds me of Japanese video games.

Alan Moore 2010-05-19 04:21:26

Answer 2

+1 A:

Assuming that you use perl/PCRE:

(.{2})\1+ or ((.)(?!\2)(.))\1+. Second regex prevents matching things like oooo.

UPD: Then 2. will be ((.)\2{N}).*?((?!\2)(.)\4{M}). Remove (?!\2) if you want to get matches like oooaoooo and replace N and M with n-1 and m-1.

ZyX 2010-05-18 09:51:44

I need a regex that matches combination of groups - for example a group of n identical chars followed by a group of m identical chars.The simpler pattern would look like this: xxyyyy - but it might also be xxayyyy or xxaaayyyy - so that both groups are seperated by arbitrary characters.

Daniel 2010-05-18 11:14:48

Updated. Hope I understood you.

ZyX 2010-05-18 11:48:20

Answer 3

A:

Examples in javascript

a = "This is my foobababababaf string"

console.log(a.replace(/(.)(.)(\1\2)+/, "<<$&>>"))

a = "This is my foobaafoobaaaooo string"

console.log(a.replace(/(.)\1+(.)\2+/, "<<$&>>"))

stereofrog 2010-05-18 09:56:34

Both of those regex can match runs of a single character, like `aaaaaa`; I don't think the OP wants that.

Alan Moore 2010-05-18 12:12:47

Answer 4

+3 A:

I think something like this is what you want.

For alternating characters:

(?=(.)(?!\1)(.))(?:\1\2){2,}

\0 will be the entire alternating sequence, \1 and \2 are the two (distinct) alternating characters.

For run of N and M characters, possibly separated by other characters (replace N and M with numbers here):

(?=(.))\1{N}.*?(?=(?!\1)(.))\2{M}

\0 will be entire match, including infix. \1 is the character repeated (at least) N times, \2 is the character repeated (at least) M times.

Here's a test harness in Java.

import java.util.regex.*;

public class Regex3 {
    static String runNrunM(int N, int M) {
        return "(?=(.))\\1{N}.*?(?=(?!\\1)(.))\\2{M}"
            .replace("N", String.valueOf(N))
            .replace("M", String.valueOf(M));
    }
    static void dumpMatches(String text, String pattern) {
        Matcher m = Pattern.compile(pattern).matcher(text);
        System.out.println(text + " <- " + pattern);
        while (m.find()) {
            System.out.println("  match");
            for (int g = 0; g <= m.groupCount(); g++) {
                System.out.format("    %d: [%s]%n", g, m.group(g));
            }
        }
    }
    public static void main(String[] args) {
        String[] tests = {
            "foobababababaf foobaafoobaaaooo",
            "xxyyyy axxayyyya zzzzzzzzzzzzzz"
        };
        for (String test : tests) {
            dumpMatches(test, "(?=(.)(?!\\1)(.))(?:\\1\\2){2,}");
        }
        for (String test : tests) {
            dumpMatches(test, runNrunM(3, 3));
        }
        for (String test : tests) {
            dumpMatches(test, runNrunM(2, 4));
        }
    }
}

This produces the following output:

foobababababaf foobaafoobaaaooo <- (?=(.)(?!\1)(.))(?:\1\2){2,}
  match
    0: [bababababa]
    1: [b]
    2: [a]
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.)(?!\1)(.))(?:\1\2){2,}
foobababababaf foobaafoobaaaooo <- (?=(.))\1{3}.*?(?=(?!\1)(.))\2{3}
  match
    0: [aaaooo]
    1: [a]
    2: [o]
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.))\1{3}.*?(?=(?!\1)(.))\2{3}
  match
    0: [yyyy axxayyyya zzz]
    1: [y]
    2: [z]
foobababababaf foobaafoobaaaooo <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{4}
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{4}
  match
    0: [xxyyyy]
    1: [x]
    2: [y]
  match
    0: [xxayyyy]
    1: [x]
    2: [y]

Explanation

(?=(.)(?!\1)(.))(?:\1\2){2,} has two parts
- (?=(.)(?!\1)(.)) establishes \1 and \2 using lookahead
  - Nested negative lookahead ensures that \1 != \2
  - Using lookahead to capture lets \0 have the entire match (instead of just the "tail" end)
- (?:\1\2){2,} captures the \1\2 sequence, which must repeat at least twice.
(?=(.))\1{N}.*?(?=(?!\1)(.))\2{M} has three parts
- (?=(.))\1{N} captures \1 in a lookahead, and then match it N times
  - Using lookahead to capture means the repetition can be N instead of N-1
- .*? allows an infix to separate the two runs, reluctant to keep it as short as possible
- (?=(?!\1)(.))\2{M}
  - Similar to first part
  - Nested negative lookahead ensures that \1 != \2

The run regex will match longer runs, e.g. run(2,2) matches "xxxyyy":

xxxyyy <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{2}
  match
    0: [xxxyy]
    1: [x]
    2: [y]

Also, it does not allow overlapping matches. That is, there is only one run(2,3) in "xx11yyy222".

xx11yyy222 <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{3}
  match
    0: [xx11yyy]
    1: [x]
    2: [y]

polygenelubricants 2010-05-18 14:14:31

ansaurus

tags:

views:

answers:

Need Regex for to match special situations

Explanation

related questions