tags:

views:

1300

answers:

5

What do I use to search for multiple words in a string? I would like the logical operation to be AND so that all the words are in the string somewhere. I have a bunch of nonsense paragraphs and one plain English paragraph, and I'd like to narrow it down by specifying a couple common words like, "the" and "and", but would like it match all words I specify.

A: 

Assuming PCRE (Perl regexes), I am not sure that you can do it at all easily. The AND operation is concatenation of regexes, but you want to be able to permute the order in which the words appear without having to formally generate the permutation. For N words, when N = 2, it is bearable; with N = 3, it is barely OK; with N > 3, it is unlikely to be acceptable. So, the simple iterative solution - N regexes, one for each word, and iterate ensuring each is satisfied - looks like the best choice to me.

Jonathan Leffler
Why do the N things have to be regexes though? Could just use "index" here.
\b(foo|bar|baz)\b.*\b(?!\1)(foo|bar|baz)\b.*\b(?!\1)(?!\2)(foo|bar|baz)\b ought to handle permutations by using back references and negative lookahead to avoid matching a word twice. It's still properly evil, but at least the pattern length isn't O(N!)
stevemegson
@BKB: I'm not sure what you mean by using an index.
Jonathan Leffler
@SteveMegson: Yes, I think I see what you're up to - and not being sure of the scope of negative lookahead (a relatively new feature of Perl - since I was really learning it, back in the days of 4.x, and 5.[0-6]), I was not dogmatic in my answer. As you say, not nice, but not combinatorial either.
Jonathan Leffler
+2  A: 

Firstly I'm not certain what you're trying to return... the whole sentence? The words in between your two given words?

Something like:

\b(word1|word2)\b(\w+\b)*(word1|word2)\b(\w+\b)*\.

(where \b is the word boundary in your language) would match a complete sentence that contained either of the two words or both..

You'd probably need to make it case insensitive so that if it appears at the start of the sentence it will still match

brass-kazoo
Doesn't that just match a sentence that contains two words, either word1 followed by word2, or word2 followed by word1 (as desired), or word1 followed by word1, or word2 followed by word2 (as not desired)? That was the sort of problem I ran into when trying to answer.
Jonathan Leffler
A: 

AND as concatenation

^(?=.*?\b(?:word1)\b)(?=.*?\b(?:word2)\b)(?=.*?\b(?:word3)\b)

OR as alternation

^(?=.*?\b(?:word1|word2|word3)\b
^(?=.*?\b(?:word1)\b)|^(?=.*?\b(?:word2)\b)|^(?=.*?\b(?:word3)\b)
MizardX
A: 

Maybe using http://en.wikipedia.org/wiki/Language_recognition_chart#English to recognize english would work. Some quick tests seem to work (this assumes paragrpahs separated by newlines only).

The regexp will match one of any of those conditions... \bword\b is word separated by boundaries word\b is a word ending and just word will match it in any place of the paragraph to be matched.

my @paragraphs = split(/\n/,$text);
for my $p (@paragraphs) {
    if ($p =~ m/\bthe\b|\band\b|\ban\b|\bin\b|\bon\b|\bthat\b|\bis\b|\bare\b|th|sh|ough|augh|ing\b|tion\b|ed\b|age\b|’s\b|’ve\b|n’t\b|’d\b/) {
       print "Probable english\n$p\n";
    }
}
Vinko Vrsalovic
I wouldn't recommend 'on' to detect English. It means 'he' in many slavic languages (as I'm sure Vinko knows ;)
Thomas Bratt
+1  A: 

Regular expressions support a "lookaround" condition that lets you search for a term within a string and then forget the location of the result; starting at the beginning of the string for the next search term. This will allow searching a string for a group of words in any order.

The regular expression for this is:

^(?=.\bword1\b)(?=.\bword2\b)(?=.*\bword3\b)

Where "\b" is a word boundary and the "?=" is the lookaround modifier.

If you have a variable number of words you want to search for, you will need to build this regular expression string with a loop - just wrap each word in the lookaround syntax and append it to the expression.