views:

210

answers:

4

Just need to see if a paragraph contains a "stop word", the stop words are in an array below.

I had the formula as:

$pattern_array = array("preheat", "minutes", "stir", "heat", "put", "beat", "bowl", "pan");

    foreach ($pattern_array as $pattern) {
      if (preg_match('/'.$pattern.')/i', $paragraph)) {
        $stopwords = 1;
      }
    }

Which works well enough but for short words like 'pan' a word like 'panko' is identified as a stop word.

So the regex would be something like it has to have a space before it or be the start of a new line and either end in a full stop/space/comma/(other non character objects).

Also how could I tell php to exit the loop as soon as a stop word is identified?

Thanks guys, slowing learning regex as I go!

+1  A: 

Haven't tried this, but \b should be the character group you're looking for. From the PHP manual:

 \b   word boundary

Your code would then look something like this:

$pattern_array = array("preheat", "minutes", "stir", "heat", "put", "beat", "bowl", "pan");

foreach ($pattern_array as $pattern) {
  if (preg_match('/\b'.$pattern.'\b/i', $paragraph)) { // also removed the ')'
    $stopwords = 1;
    break; // to exit the loop
  }
}

Edit: seems people are better off using \b, so changed this accordingly

Cassy
it won't match at the end of the subject string.
SilentGhost
or the beginning for that matter
SilentGhost
changed the code to use `\b`, thanks for the comments :-)
Cassy
+1  A: 

you need to add \b (which stands for word boundary) to your regex like this:

'/\b'.$pattern.'\b/i'

You seem to have a typo in your code, because either you have a literal closing bracket (and don't match parts of the words) or you have an open closing bracket.

SilentGhost
yes sorry that is a typo from a previous code test
bluedaniel
+2  A: 

Use \b(preheat|minutes|stir|heat|put|bowl|pan)\b as your regex. That way, you only need one regex (no looping necessary), and by using the \b word boundary assertions, you make sure that only entire words match.

Tim Pietzcker
Ok Ive used that approach (the all in one regex not the \b) and I was warned about performance problems if the amount of items in the regex becomes too large. How many items would be too many?
bluedaniel
Hard to say. I guess you're stuck with regexes if you want to match word boundaries, and looping over a multitude of regexes is probably slower than having one large regex. You could do some optimizations like `\b(p(?:reheat|ut|an)|st(?:ir|ove)|etc.)\b` so the regex engine can skip a partial match after finding that the first character(s) don't match, but better try it first before optimizing unnecessarily.
Tim Pietzcker
Hmm thats an interesting approach, its definitely a case of getting an app working as expected and then optimizing the little segments of stuff. Ill give that a try later and for your follow up ill accept your answer. cheers tim.
bluedaniel
A: 

1. You can use "\b" to check for word boundaries. A word boundary is defined as the boundary between a word character and a non-word character. word-characters are letters, numbers, and underscore.

2. You can do it all at one go, by using "|":

$stopwords = preg_match('/\\b(preheat|minutes|stir|heat|..other words..|pan)\\b/i', $paragraph)
Edward Loper
Ok Ive used that approach (the all in one regex not the \b) and I was warned about performance problems if the amount of items in the regex becomes too large. How many items would be too many?
bluedaniel