views:

129

answers:

4

Hi All,

The subject is probably not as clear as it could be, but I was struggling to think of a better way to easily describe it.

I am implementing a badword filter on some articles that we pick up from an XML feed. At the moment I have the badwords in an array and simply check the text like so;

str_replace($badwords, '', $text, $count); 
if ($count > 0) // We have bad words...

But this is SLOW! So slow! And when I am trying to process 30,000+ articles at a time, I start wondering if there is a better way to achieve this. If only strpos supported arrays! Even then I dont think it'd be faster...

I'd love any suggestions. Thanks in advance!

EDIT:

I have now tested a few methods between calls to microtime() to time them. str_replace() = 990 seconds preg_match() = 1029 seconds (Remember I only need to identify them, not replace them) no bad word filtering = 1057 seconds (presumably because it has another thousand or so bad-worded articles to process.

Thanks for all the answers, I will just still with str_replace. :)

+2  A: 

How about combining all the words in a regex to replace everything in one go? I'm not sure how it will go for performance but it might be faster.

E.g.

preg_replace('/(' . implode('|', $badwords) . ')/i', '', $text);
Michael
That might work well, thanks Michael will give that a go.
Christian
+1  A: 

Define "slow"? Anything that's going to be processing 30,000 articles is probably going to take a bit of time to complete.

That said, one option (which I have not benchmarked, just tossing it out there for consideration) would be to combine the words into a regex and run that through preg_replace (just using the | operator to put them together).

Amber
Just as Michael said. I ran this same script yesterday before implementing the bad words filter and while I didn't time it, it was a lot faster. I will benchmark the script again with no filter, badwords str_replace and badwords preg_replace and see how much it makes a difference.
Christian
Well I wasn't really referring to the rest of the script even (though that's another potential factor, though it sounds like it isn't the problem in your case). Mostly it's that you're doing fulltext operations with many potential match strings. I'm not actually sure whether regex compilers usually do this or not, but finding a way to condense the regex into a sort of tree form (e.g. `(a(b|c)|d(e|f))` instead of `(ab|ac|de|df)`) might help the regex parse faster (since it could discard the match earlier), but it's quite possible that regex compilers already take care of that for you.
Amber
Yeah I'm not sure that the tree form is possible to be honest. Nice idea though.
Christian
You can generate the expression the same way you would generate a *trie*: http://en.wikipedia.org/wiki/Trie
Amber
+2  A: 

hi, i used to work at my local newspaper office. instead of modifying the text to delete badwords from the original files, what i did was just run a filter when a user requested to view the article. this way you preserve the original text should you ever need it, but also dish out a clean version for your viewers. there should be no need to process 30,000 articles at once unless i am misunderstanding something.

marauder
This has the disadvantage of slowing down page load for the user AND adding server-side load that duplicates work for every page load. On the other hand, depending on traffic patterns this might come out as a net win.
Paul McMillan
We receive xml files of articles daily, so they are processed over night when there is no other server usage. The articles are already being processed and we will always have the xml copies should we need them. We can't afford the overhead of running the filter on view unfortunately. :(
Christian
+1  A: 

In case these previous questions are useful:

karim79
Awesome, thanks for that, I did a few searches but didn't find anything as I am trying to increase the performace not just implement one.
Christian