What I suspect you're after is this:
List<String> findPhrases(String s, String... phrases) {
return findPhrases(s, Arrays.asList(phrases));
}
List<String> findPhrases(String s, Collection<String> phrases) {
if (phrases.size() < 1) {
throw new IllegalArgumentException("must specify at least one phrase");
}
StringBuilder sb = new StringBuilder();
Iterator<String> iter = phrases.iterator();
String first = iter.next();
sb.append(first);
while (iter.hasNext()) {
sb.append("|");
sb.append(iter.next());
}
Pattern p = Pattern.compile("\\b(" + sb.toString() + ")\\b");
Matcher m = p.matcher(s);
List<String> ret = new ArrayList<String>();
while (m.find()) {
ret.append(Pattern.quote(m.group(1)));
}
return ret;
}
One important difference here is that I've used \b rather than \W to delimit words. \b is a zero-width match to the start of the string, the end of the string or the transition from a word character to a non-word character of vice versa.
Zero-width means it doesn't consume a character from the input like \W does.
Edit: you seem to have two problems:
- \W is consuming characters from your input; and
- You have regex special characters in your phrases.
(1) can be handled several ways. My approach above is to use \b instead as it is zero-width and is a much better solution. You can also use other zero-width assertions like lookaheads and lookbehinds:
<?<=\W|^)...(?=\W|$)
but that's basically equivalent to:
\b...\b
which is far easier to read.
(2) can be handled by quoting phrases. I've amended the above code to call Pattern.quote()
to quote any regex special characters.