views:

49

answers:

2

I am trying to parse a phrase and exclude common words.

For instance in the phrase "as the world turns", I want to exclude the common words "as" and "the" and return only "world" and "turns".

(\w+(?!the|as))

Doesn't work. Feedback appreciated.

+2  A: 

The lookahead should come first:

(\b(?!(the|as)\b)\w+\b)

I have also added word boundaries to ensure that it only matches whole words otherwise it would fail to match the complete word "as" but it would successfully match the letter "s" of that word.

You might also want to consider what \w matches and if that meets your needs. If you are looking for words in English you probably are interested in letters but not digits and you may wish to include some punctuation characters that are excluded by \w, such as apostrophes. You could try something like this instead (Rubular):

/(\b(?!(?:the|as)\b)[a-z'-]+\b)/i

To match words more accurately in a human language you could consider using a natural language parsing library instead of regular expressions.

Mark Byers
+1  A: 

You should use word boundaries to only match whole words. Either with a look-ahead assertion:

(\b(?!(?:the|as)\b)\w+\b)

Or with a look-behind assertion:

(\b\w+\b(?<!\b(?:the|as)))
Gumbo