views:

67

answers:

2

Is it possible to create a regex that matches all strings with five a's and five b's?

Like aaaaabbbbb or ababababab or aabbaabbab.

I imagine it would require polynomial time for a deterministic engine.

Are there other matching languages which would enable such matching?

Update:

I wanted to use the kind of expression for searching, so I changed the one purposed to (?=b*ab*){5}(?=a*ba*){5}([ab]{10}) and it works nicely! :) I'm still uncertain with respects to the performance of an expression like that. But I guess I can just look up lookahead expressions.

I'm still curious to which other kinds of patterns, that are simple to explain but hard to regex, are out there?

+5  A: 

You can use lookahead assertions:

^(?=(?:[^a]*a){5}[^a]*$)(?=(?:[^b]*b){5}[^b]*$)
Gumbo
You might want to add a `.{10}$` after the lookahead. The specs are unclear, but from the examples it seems the string should contain exactly 5 a's and 5 b's and nothing else.
Tim Pietzcker
@Tim Pietzcker: In that cas I would rather replace the `[^a]` with `b` and `[^b]` with `a`.
Gumbo
Actually I used `(?=(?:[^a]*a){5}[^a]*)(?=(?:[^b]*b){5}[^b]*)(.{10})` to search the string `aababababab` and it found the wrong part: `aababababa`.I don't see a way to fix this unless you can set a character limit on the lookaheads?
Thomas Ahle
@Thomas Ahle: So you rather want to *find* sequences with five `a`s and five `b`s?
Gumbo
Well, find, match and sub. Like you normally do with regexes.But mostly find.
Thomas Ahle
@Thomas Ahle: Then try this one: `[ab]*(?=(?:b*a){5})(?=(?:a*b){5})(.{10})`. But it’s getting more and more inefficient.
Gumbo
Hm, doesn't that one just rightalign the findings? Now it breaks on `aabababababb` => `bababababb`
Thomas Ahle
+6  A: 

I have all these screws. To hammer them into this piece of wood, should I use a claw or ball-peen hammer?

That's (roughly) what your question is asking. What you should do is just loop through each character of the string. I can do it in C. Watch:

int validate(char *s)
{
    int a = 0, a = 0;
    while(*s)
      {
        switch(*s++)
        {
        case 'a':
            a++;
            break;
        case 'b':
            b++;
            break;
        }
      }
    return a == 5 && b == 5;
}

It is left as an excercise to you to a) convert this to your language of choice, b) modify this to match only consecutive sequences of 'a's and 'b's (if you like) or tweak it to your other specific requirements.

The basic point is that there are much better tools for this job than regex, so unless "a" and "b" are stand-ins for more complicated regular expressions, don't use regexes for this. And even if "a" and "b" are really more complicated regexes, you don't have to solve all problems with One Regex To Rule Them All. You can mix a few useful regexes and a loop of code (like the above) to much greater effect than an enormous (and unmaintainable) Regex-zilla.

Chris Lutz
I like the idea of applying more than one matching regex to a string like filters.However this approach will surely make it hard to use regex findall and sub?
Thomas Ahle
Yes, but the code I gave can (again) be tweaked to find all instances (or find the indices of instances) of matches. You can write variants of the function that will do `findall` and `sub` functions. You can even make a module that will allow you to compile several regexes into one mega-regex object and specify complex matching behavior (like `superegex.compile("a^5 b^5", a="a regex", b="b regex")` perhaps) to make this job easier. It'll probably be easier to understand and maintain (and more useful in the future when you have to do this job again).
Chris Lutz
Also, when asking regex questions, you should still specify what language you're using. Different languages handle regexes differently, so even if we can give you a pure-regex solution, it may not work on your language's regex flavor. Plus, telling us that you're using (I presume) Python can help if we think you would be better off using an alternative approach.
Chris Lutz