tags:

views:

96

answers:

4

I'm desperately searching for regular expressions that match these scenarios:

1) Match alternating chars

I've a string like "This is my foobababababaf string" - and I want to match "babababa"

Only thing I know is the length of the fragment to search - I don't know what chars/digits that might be - but they are alternating.

I've really no clue where to start :(

2) Match combined groups

In a string like "This is my foobaafoobaaaooo string" - and I want to match "aaaooo". Like in 1) I don't know what chars/digits that might be. I only know that they will appear in two groups.

I experimented using (.)\1\1\1(.)\1\1\1 and things like this...

A: 

Well, this works for the first one...

((.)(.))(\2\3)+
rikh
Can you please give a short explanation of this regex?
Daniel
Yep. ((.)(.)) matches 2 characters next to each other (whilst also looking suggestive). (\2\3) matches the same 2 characters again, and the + says this must happen 1 or more times. Looking at it, you may be able to ditch the outer brackets in the first set and change the back references accordingly.
rikh
But it looks more suggestive with the outer parentheses, so please leave them in.
Tim Pietzcker
For some reason, with the outer parens there it reminds me of Japanese video games.
Alan Moore
+1  A: 

Assuming that you use perl/PCRE:

  1. (.{2})\1+ or ((.)(?!\2)(.))\1+. Second regex prevents matching things like oooo.

UPD: Then 2. will be ((.)\2{N}).*?((?!\2)(.)\4{M}). Remove (?!\2) if you want to get matches like oooaoooo and replace N and M with n-1 and m-1.

ZyX
I need a regex that matches combination of groups - for example a group of n identical chars followed by a group of m identical chars.The simpler pattern would look like this: xxyyyy - but it might also be xxayyyy or xxaaayyyy - so that both groups are seperated by arbitrary characters.
Daniel
Updated. Hope I understood you.
ZyX
A: 

Examples in javascript

a = "This is my foobababababaf string"

console.log(a.replace(/(.)(.)(\1\2)+/, "<<$&>>"))

a = "This is my foobaafoobaaaooo string"

console.log(a.replace(/(.)\1+(.)\2+/, "<<$&>>"))
stereofrog
Both of those regex can match runs of a single character, like `aaaaaa`; I don't think the OP wants that.
Alan Moore
+3  A: 

I think something like this is what you want.

For alternating characters:

(?=(.)(?!\1)(.))(?:\1\2){2,}

\0 will be the entire alternating sequence, \1 and \2 are the two (distinct) alternating characters.

For run of N and M characters, possibly separated by other characters (replace N and M with numbers here):

(?=(.))\1{N}.*?(?=(?!\1)(.))\2{M}

\0 will be entire match, including infix. \1 is the character repeated (at least) N times, \2 is the character repeated (at least) M times.

Here's a test harness in Java.

import java.util.regex.*;

public class Regex3 {
    static String runNrunM(int N, int M) {
        return "(?=(.))\\1{N}.*?(?=(?!\\1)(.))\\2{M}"
            .replace("N", String.valueOf(N))
            .replace("M", String.valueOf(M));
    }
    static void dumpMatches(String text, String pattern) {
        Matcher m = Pattern.compile(pattern).matcher(text);
        System.out.println(text + " <- " + pattern);
        while (m.find()) {
            System.out.println("  match");
            for (int g = 0; g <= m.groupCount(); g++) {
                System.out.format("    %d: [%s]%n", g, m.group(g));
            }
        }
    }
    public static void main(String[] args) {
        String[] tests = {
            "foobababababaf foobaafoobaaaooo",
            "xxyyyy axxayyyya zzzzzzzzzzzzzz"
        };
        for (String test : tests) {
            dumpMatches(test, "(?=(.)(?!\\1)(.))(?:\\1\\2){2,}");
        }
        for (String test : tests) {
            dumpMatches(test, runNrunM(3, 3));
        }
        for (String test : tests) {
            dumpMatches(test, runNrunM(2, 4));
        }
    }
}

This produces the following output:

foobababababaf foobaafoobaaaooo <- (?=(.)(?!\1)(.))(?:\1\2){2,}
  match
    0: [bababababa]
    1: [b]
    2: [a]
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.)(?!\1)(.))(?:\1\2){2,}
foobababababaf foobaafoobaaaooo <- (?=(.))\1{3}.*?(?=(?!\1)(.))\2{3}
  match
    0: [aaaooo]
    1: [a]
    2: [o]
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.))\1{3}.*?(?=(?!\1)(.))\2{3}
  match
    0: [yyyy axxayyyya zzz]
    1: [y]
    2: [z]
foobababababaf foobaafoobaaaooo <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{4}
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{4}
  match
    0: [xxyyyy]
    1: [x]
    2: [y]
  match
    0: [xxayyyy]
    1: [x]
    2: [y]

Explanation

  • (?=(.)(?!\1)(.))(?:\1\2){2,} has two parts
    • (?=(.)(?!\1)(.)) establishes \1 and \2 using lookahead
      • Nested negative lookahead ensures that \1 != \2
      • Using lookahead to capture lets \0 have the entire match (instead of just the "tail" end)
    • (?:\1\2){2,} captures the \1\2 sequence, which must repeat at least twice.
  • (?=(.))\1{N}.*?(?=(?!\1)(.))\2{M} has three parts
    • (?=(.))\1{N} captures \1 in a lookahead, and then match it N times
      • Using lookahead to capture means the repetition can be N instead of N-1
    • .*? allows an infix to separate the two runs, reluctant to keep it as short as possible
    • (?=(?!\1)(.))\2{M}
      • Similar to first part
      • Nested negative lookahead ensures that \1 != \2

The run regex will match longer runs, e.g. run(2,2) matches "xxxyyy":

xxxyyy <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{2}
  match
    0: [xxxyy]
    1: [x]
    2: [y]

Also, it does not allow overlapping matches. That is, there is only one run(2,3) in "xx11yyy222".

xx11yyy222 <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{3}
  match
    0: [xx11yyy]
    1: [x]
    2: [y]
polygenelubricants