views:

379

answers:

3
+1  Q: 

PHP Stop Word List

I'm playing about with a stop words within my code I have an array full of words that I'd like to check, and an array of words I want to check against.

At the moment I'm looping through the array one at at a time and removing the word if its in_array vs the stop word list but I wonder if there's a better way of doing it, I've looked at array_diff and such however if I have multiple stop words in the first array, array_diff only appears to remove the first occurrence.

The focus is on speed and memory usage but speed more so.

Edit -

The first array is singular words, based on blog comments (these are usually quite long) the second array is singular words of stop words. Sorry for not making that clear

Thanks

+3  A: 

Using str_replace...

A simple approach is to use str_replace or str_ireplace, which can take an array of 'needles' (things to search for), corresponding replacements, and an array of 'haystacks' (things to operate on).

$haystacks=array(
  "The quick brown fox",
  "jumps over the ",
  "lazy dog"
);

$needles=array(
  "the", "lazy", "quick"
);

$result=str_ireplace($needles, "", $haystacks);

var_dump($result);

This produces

array(3) {
  [0]=>
  string(11) "  brown fox"
  [1]=>
  string(12) "jumps over  "
  [2]=>
  string(4) " dog"
}

As an aside, a quick way to clean up the trailing spaces this leaves would be to use array_map to call trim for each element

$result=array_map("trim", $result);

The drawback of using str_replace is that it will replace matches found within words, rather than just whole words. To address that, we can use regular expressions...

Use preg_replace

An approach using preg_replace looks very similar to the above, but the needles are regular expressions, and we check for a 'word boundary' at the start and end of the match using \b

$haystacks=array(
"For we shall use fortran to",
"fortify the general theme",
"of this torrent of nonsense"
);

$needles=array(
  '/\bfor\b/i', 
  '/\bthe\b/i', 
  '/\bto\b/i', 
  '/\bof\b/i'
);

$result=preg_replace($needles, "", $haystacks);
Paul Dixon
Thanks Paul, I'll look into that and see if I can get that working but I've updated the question, the $haystacks array I have is only full of singular words not sentances
Dom Hodgson
A quick check shows that this does what it says on the tin however, if I have for on my stop word list, it removes it from everything including words such as fortran, fort and such.
Dom Hodgson
In that case, you'll need to use preg_replace
Paul Dixon
A: 

what about using in_array

http://au.php.net/manual/en/function.in-array.php

The function accepts a needle that is an array.

bool in_array ( mixed $needle , array $haystack [, bool $strict ] )

alternatively you could loop through your stop words one by one, and find all the matches

Bingy
A: 

If you already have two sorted arrays, you can use this algorithm to remove each element from array A that is also in array B (in mathematical terms: A \ B):

for ($i=0, $n=count($a), $j=0, $m=count($b); $i<$n && $j<$m; ) {
    $diff = strcmp($a[$i], $b[$j]);
    if ($diff == 0) {
        unset($a[$i]);
        $i++;
    }
    if ($diff < 0) {
        $i++;
    }
    if ($diff > 0) {
        $j++;
    }
}

This does only require O(n) steps.

Another approach would be to use the words of array B as keys for an index (using array_flip), iterate the values of A and see if they are a key in the index using array_key_exists:

$index = array_flip($b);
foreach ($a as $key => $val) {
    if (array_key_exists($val, $b)) {
        unset($a[$key]);
    }
}

Again, this is O(n) as it avoids looking up each value in B for each value in A that would be O(n2).

Gumbo