tags:

views:

141

answers:

3

In Java's regular expression, I want to match any sentence that contains the word "Mary" and the word "are" in that order, but DOES NOT contain "Bob" in between "Mary" and "are".

Eg: Mary and Rob are married - MATCH
Eg: Mary and John and Michael became good friends and are living together <- MATCH
Eg: Mary, Rob and Bob are dead <- does not MATCH

any ideas?

+1  A: 
 (?m)^(?:(?<!\bare\b).)*?Mary(?:(?<!\bBob\b).)+are.*?$

should do it.

A couple of fixed-length negative look-behind ensure that:

  • Mary is not preceded by "are" (the word "are")
  • are is not preceded by Bob

It reads:

  • ^: anchor for: start matching at the beginning of the line
  • (?::do not capture as a group the following
  • (?<!\bare\b).: any not newline character not preceded by the word are (meaning "Mare" would not prevent the next character to match, but "... are x" would prevent " x" to match): see word boundaries
  • )*?: match at least one character

  • same principle for 'are' (not preceded by "Bob" as a word)

  • .*?$: 0 to n characters after "are" until the end of the line.

More on regular-expressions.info.

So the Pattern:

Pattern.compile("(?m)^(?:(?<!\\bare\\b).)*?Mary(?:(?<!\\bBob\\b).)+are.*?$");

would return 2 matches out of the three lines:

Eg: Mary and Rob are married - MATCH
Eg: Mary and John and Michael became good friends and are living together <- MATCH
Eg: Mary, Rob and Bob are dead <- does not MATCH
VonC
Regular expressions make my head explode.Is there a primer somewhere that does a good job of explaining everything to layfolk? I'm voting this answer up just for the fact that you know that giant string actually has meaning...
Chris Sobolewski
+3  A: 

A bit shorter version:

(?m)^.*\bMary\b((?!\bBob\b).)*\bare\b.*$


public class Main {
    public static void main(String[] args) {
        String[] tests = {
                "Mary and Rob are married",
                "Mary and John and Michael became good friends and are living together",
                "Mary, Rob and Bob are dead"
        };
        String regex = "(?m)^.*\\bMary\\b((?!\\bBob\\b).)*\\bare\\b.*$";
        for(String t : tests) {
            System.out.println(t.matches(regex) + " -> " + t);
        }
    }
}
Bart Kiers
Shorter and much, much clearer.
Alan Moore
-1: "are" can precede "Mary". I have update my regex to simplify it.
VonC
@VonC: the OP asked 'contains the word "Mary" and the word "are" in that order', which is what my regex does. It does not say that there cannot be another occurrence of the word 'are' before 'Mary'.
Bart Kiers
I would interpret that differently, but I guess only the OP could settle this ;) In doubt, let me make a small edit in order to cancel your -1
VonC
+2  A: 

As I write, there are two great answers which do it in one regex.

I want to suggest that unless you're optimising for performance (and remember, premature optimisation is bad, m'kay?) it's worth splitting into more, simpler, regexs, and using language features for readability.

Not that complex regex's are always efficient anyway -- it's easy to accidentally write a regex that backtracks all over the place.

It's also kind to readers of your code who may be unfamiliar with the more exotic features of whatever regex dialect you have.

boolean isMatch(String s) {
    // First pass test
    Pattern basicPattern = Pattern.compile("\bMary\b.*\bare\b");
    // ... and a test for exclusions
    String rejectRE = "\bMary\b.*\bBob\b.*\bare\b";

    Matcher m = basicPattern.matcher(s);

    while(m.find()) {
         // We have a candidate match
         if(! m.group().matches(rejectRE)) {
              // and it passed the secondary test
              return true;
         }
    }

    // we fell through
    return false;
}
slim
Interesting approach too. +1
VonC