tags:

views:

122

answers:

2

I need to remove any instances of 544 full-text stopwords from a user-entered search string, then format it to run a partial match full-text search in boolean mode.

input: "new york city", output: "+york* +city*" ("new" is a stopword).

I have an ugly solution that works: explode the search string into an array of words, look up each word in the array of stopwords, unset them if there is a match, implode the remaining words and finally run a regex to add the boolean mode formatting. There has to be a more elegant solution.

My question has 2 parts.

1) What do you think is the cleanest way to do this?

2) I solved part of the problem using a huge regex but this raised another question.

EDIT: This actually works. I'm embarrassed to say that the memory issue I was having (and believed was my regex) was actually generated later in the code due to the huge number of matches after filtering out stopwords.

$tmp  = preg_replace('/(\b('.implode('|',$stopwords).')\b)+/','',$this->val);
$boolified = preg_replace('/([^\s]+)/','+$1*',$tmp);
+1  A: 

Split a search string in a words array and then

  • do array_diff() with stopwords array
  • or make stopwords a hash and use hash lookups (if isset($stopwords[$word]) then...)
  • or keep stopwords sorted and use binary search for each word

it's hard to say what's going to be faster, you might want to profile each option (and if you do, please share the results!)

stereofrog
I had definitely overlooked array_diff. We haven't had speed issues so I'm more concerned with readability. Thanks for the fast response.
anyaelena
+2  A: 

Build a suffix tree from the 544 words and just walk trough it with the input string letter by letter and jump back to the root of the tree at the beginning of every new word. When you find a match at the end of a word, remove it. This is O(n) over the length of the input strings if the word list reamins static.

x4u
good idea, but in the world of scripting languages, a better algorithm implemented in userland can still be slower than a simple one inside the engine - just because C code is 1000+ times faster.
stereofrog