ansaurus

Question

Fast way to match an array of words with a block of text?

Answer 1

+2 A:

How about combining all the words in a regex to replace everything in one go? I'm not sure how it will go for performance but it might be faster.

E.g.

preg_replace('/(' . implode('|', $badwords) . ')/i', '', $text);

Michael 2009-09-16 01:16:27

That might work well, thanks Michael will give that a go.

Christian 2009-09-16 01:19:11

Answer 2

+1 A:

Define "slow"? Anything that's going to be processing 30,000 articles is probably going to take a bit of time to complete.

That said, one option (which I have not benchmarked, just tossing it out there for consideration) would be to combine the words into a regex and run that through preg_replace (just using the | operator to put them together).

Amber 2009-09-16 01:19:04

Just as Michael said. I ran this same script yesterday before implementing the bad words filter and while I didn't time it, it was a lot faster. I will benchmark the script again with no filter, badwords str_replace and badwords preg_replace and see how much it makes a difference.

Christian 2009-09-16 01:20:44

Well I wasn't really referring to the rest of the script even (though that's another potential factor, though it sounds like it isn't the problem in your case). Mostly it's that you're doing fulltext operations with many potential match strings. I'm not actually sure whether regex compilers usually do this or not, but finding a way to condense the regex into a sort of tree form (e.g. `(a(b|c)|d(e|f))` instead of `(ab|ac|de|df)`) might help the regex parse faster (since it could discard the match earlier), but it's quite possible that regex compilers already take care of that for you.

Amber 2009-09-16 01:34:05

Yeah I'm not sure that the tree form is possible to be honest. Nice idea though.

Christian 2009-09-16 04:04:40

You can generate the expression the same way you would generate a *trie*: http://en.wikipedia.org/wiki/Trie

Amber 2009-09-16 04:49:21

Answer 3

+2 A:

hi, i used to work at my local newspaper office. instead of modifying the text to delete badwords from the original files, what i did was just run a filter when a user requested to view the article. this way you preserve the original text should you ever need it, but also dish out a clean version for your viewers. there should be no need to process 30,000 articles at once unless i am misunderstanding something.

marauder 2009-09-16 01:46:22

This has the disadvantage of slowing down page load for the user AND adding server-side load that duplicates work for every page load. On the other hand, depending on traffic patterns this might come out as a net win.

Paul McMillan 2009-09-16 01:49:41

We receive xml files of articles daily, so they are processed over night when there is no other server usage. The articles are already being processed and we will always have the xml copies should we need them. We can't afford the overhead of running the filter on view unfortunately. :(

Christian 2009-09-16 01:56:44

Answer 4

+1 A:

In case these previous questions are useful:

karim79 2009-09-16 01:54:00

Awesome, thanks for that, I did a few searches but didn't find anything as I am trying to increase the performace not just implement one.

Christian 2009-09-16 01:57:49

ansaurus

tags:

views:

answers:

Fast way to match an array of words with a block of text?

related questions