tags:

views:

1753

answers:

3

How do I write a Pattern (Java) to match any sequence of characters except a given list of words?

I need to find if a given code has any text surrounded by tags like besides a given list of words. For example, I want to check if there are any other words besides "one" and "two" surrounded by the tag .

"This is the first tag <span>one</span> and this is the third <span>three</span>"

The pattern should match the above string because the word "three" is surrounded by the tag and is not part of the list of given words ("one", "two").

+1  A: 

Use this:

if (!Pattern.matches(".*(word1|word2|word3).*", "word1")) {
    System.out.println("We're good.");
};

You're checking that the pattern does not match the string.

sjbotha
Thanks for you response but this will not work. I added more information to the description of the problem.
Mario
+3  A: 

Look-ahead can do this:

\b(?!your|given|list|of|exclusions)\w+\b

Matches

  • a word boundary (start-of-word)
  • not followed by any of "your", "given", "list", "of", "exclusions"
  • followed by multiple word characters
  • followed by a word boundary (end-of-word)

In effect, this matches any word that is not excluded.

Tomalak
+3  A: 

This should get you started.

import java.util.regex.*;

// >(?!one<|two<)(\w+)/
// 
// Match the character “>” literally «>»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!one|two)»
//    Match either the regular expression below (attempting the next alternative only if this one fails) «one»
//       Match the characters “one<” literally «one»
//    Or match regular expression number 2 below (the entire group fails if this one fails to match) «two»
//       Match the characters “two<” literally «two»
// Match the regular expression below and capture its match into backreference number 1 «(\w+)»
//    Match a single character that is a “word character” (letters, digits, etc.) «\w+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the characters “/” literally «</»
List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile(">(?!one<|two<)(\\w+)/");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
     matchList.add(regexMatcher.group(1));
    } 
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}
Lieven
I think you might want to change the "one" and "two" in the pattern to "one<" and "two<" so you can still match things that start with either of those.
Marty Lamb
@Marty - you're right. I'll update the answer.
Lieven