tags:

views:

471

answers:

2

I'm trying to create a Regex usuable in C# that will allow me to take a list of single letters and/or letter groups and ensure that a word is only comprised of items from that list. For instance:

  • 'a' would match 'a', 'aa', 'aaa', but not 'ab'
  • 'a b' would match 'a', 'ab', 'abba', 'b', but not 'abc'
  • 'a b abc' would match 'a', 'ab', 'abc', 'aabc', 'baabc', but not 'ababac'

I thought something of the form

(a|b|abc)*

would work, but it incorrectly matches the last term. Here's the code I'm testing with:

[Fact]
public void TestRegex()
{
    Regex regex = new Regex("(a|b|abc)*");
    regex.IsMatch("a").ShouldBeTrue();
    regex.IsMatch("b").ShouldBeTrue();
    regex.IsMatch("abc").ShouldBeTrue();
    regex.IsMatch("aabc").ShouldBeTrue();
    regex.IsMatch("baabc").ShouldBeTrue();

    // This should not match ... I don't think anyway
    regex.IsMatch("ababac").ShouldBeFalse();
}

I have a pretty basic understanding of regex, so apologies if I'm missing something obvious here :)

Update I don't understand why your counter-example is a counter-example : ababac = a b a bac. cCould you clarify ?

I only want to use 'a', 'b', and 'abc' - 'bac' would be a completely different term.

Let me give another example: Using 'ba' and 't', I could match the word 'bat', but not 'tab'. The order of the letters inside the letter groups is important.

(Tests with Diadistis' solution)

    [Fact]
    public void TestRegex()
    {
        Regex regex = new Regex(@"\A(?:(e|l|ho)*)\Z");
        regex.IsMatch("e").ShouldBeTrue();
        regex.IsMatch("l").ShouldBeTrue();
        regex.IsMatch("ho").ShouldBeTrue();
        regex.IsMatch("elho").ShouldBeTrue();
        regex.IsMatch("hole").ShouldBeTrue();
        regex.IsMatch("holle").ShouldBeTrue();
        regex.IsMatch("hello").ShouldBeFalse();
        regex.IsMatch("hotel").ShouldBeFalse();
    }
+4  A: 

I am not quite sure what are you trying to do but in order for the last one to be false you should check if the string can be matched entirely :

Regex regex = new Regex(@"\A(?:(a|b|abc)*)\Z");
Diadistis
I'm trying to take a list of single letters and/or groups of letters and identify whether a word is comprised solely of those terms. If my list of terms is (e, l, ho) then 'hole' would be a valid match, but 'hello' and 'hotel' would not.
Jedidja
I don't think there's any value in capturing the individual chunks, if there's no value in capturing the whole thing \A(?:a|b|abc)*\Z works just as well.
Axeman
+2  A: 

Try bracketing your regex with ^ and $ to ensure that exactly the whole line is considered:

^(a|b|abc)*$
Michael Burr
This one is more useful for those who don't use C# regexps.
Arkadiy