tags:

views:

59

answers:

3

I'm trying to pull all sentences from a text that consist of, say, at least 5 words in PHP. Assuming sentences end with full stop, question or exclamation mark, I came up with this:

 /[\w]{5,*}[\.|\?|\!]/ 

Any ideas, what's wrong?

Also, what needs to be done for this to work with UTF-8?

+3  A: 

\w only matches a single character. A single word would be \w+. If you need at least 5 words, you could do something like:

/(\w+\s){4,}\w+[.?!]/

i.e. at least 4 words followed by spaces, followed by another word followed by a sentence delimiter.

casablanca
+1 for being faster
bitmask
A: 

I agree with the solution posted here. If you're using preg functions in PHP you can add 'u' pattern modifier for this to work with UTF-8. /(\w+\s){4,}\w+[.?!]/u for example

Viktor Stískala
A: 

The without regex method:

$str = "this is a more than five word sentence. But this is not. Neither this. NO";

$sentences = explode(".", $str);
foreach($sentences as $s)
{
   $words = explode(' ', $s);
   if(count(array_filter($words, 'is_notempty')) > 5)
       echo "Found matching sentence : $s" . "<br/>";
}

function is_notempty($x)
{
 return !empty($x);
}

This outputs:

Found matching sentence : this is a more than five word sentence

shamittomar
Note that you can only "explode" with a single delimiter. The OP stated that sentences could end with any of `.?!`.
casablanca