views:

249

answers:

6

What is the fastest way in PHP to take a keyword list and match it to a search result (like an array of titles) for all words?

For instance, if my keyword phrase is "great leather shoes", then the following titles would be a match...

  • Get Some Really Great Leather Shoes
  • Leather Shoes Are Great
  • Great Day! Those Are Some Cool Leather Shoes!
  • Shoes, Made of Leather, Can Be Great

...while these would not be a match:

  • Leather Shoes on Sale Today!
  • You'll Love These Leather Shoes Greatly
  • Great Shoes Don't Come Cheap

I imagine there's some trick with array functions or a RegEx (Regular Expression) to achieve this rapidly.

+1  A: 

I can't offer you a definitive answer but I'd try benchmarking each solution thats suggested and would start with chaining some in_array's together

if (in_array('great', $list) && in_array('leather', $list) && in_array('shoes', $list)) {

// Do something
}
Danten
+3  A: 

you can preg_grep() your array against something like

 /^(?=.*?\bgreat)(?=.*?\bleather)(?=.*?\shoes)/

or (probably faster) grep each word separately and then array_intersect the results

stereofrog
+2  A: 

It might be a pretty naive solution (quite possibly there are more efficient/elegant solutions), but I'ld probably do something like the following:

$keywords = array(
    'great',
    'leather',
    'shoes'
);

$titles = array(
    'Get Some Really Great Leather Shoes',
    'Leather Shoes Are Great',
    'Great Day! Those Are Some Cool Leather Shoes!',
    'Shoes, Made of Leather, Can Be Great',
    'Leather Shoes on Sale Today!',
    'You\'ll Love These Leather Shoes Greatly',
    'Great Shoes Don\'t Come Cheap'
);

$matches = array();
foreach( $titles as $title )
{
  $wordsInTitle = preg_split( '~\b(\W+\b)?~', $title, null, PREG_SPLIT_NO_EMPTY );
  if( array_uintersect( $keywords, $wordsInTitle, 'strcasecmp' ) == $keywords )
  {
    // we have a match
    $matches[] = $title;
  }
}

var_dump( $matches );

No idea how this benchmarks though.

fireeyedboy
+3  A: 

I would use an index for the words in the titles and test if every search term is in that index:

$terms = explode(' ', 'great leather shoes');
$titles = array(
    'Get Some Really Great Leather Shoes',
    'Leather Shoes Are Great',
    'Great Day! Those Are Some Cool Leather Shoes!',
    'Shoes, Made of Leather, Can Be Great'
);
foreach ($titles as $title) {
    // extract words in lowercase and use them as key for the word index
    $wordIndex = array_flip(preg_split('/\P{L}+/u', mb_strtolower($title), -1, PREG_SPLIT_NO_EMPTY));
    // look up if every search term is in the index
    foreach ($terms as $term) {
        if (!isset($wordIndex[$term])) {
            // if one is missing, continue with the outer foreach
            continue 2;
        }
    }
    // echo matched title
    echo "match: $title";
}
Gumbo
+1 for the unicode support.
fireeyedboy
+1  A: 

You could use

/(?=.*?\great\b)(?=.*?\bshoes\b)(?=.*?\bleather\b)/

Note a couple of things

a)You need word boundaries at both ends else you could end up matching words that contain the ones you are looking for eg "shoes of leather bring greatness".

b)I use lazy wildcard match (i.e .*?). This improves effeciency, as by default * is greedy (i.e. it consumes as many characters as it can match, and only gives them up in favor of a overall match). So if we don't have the trailing ?, .* will match everything in the line and then backtrack to match 'great'. Same procedure is then repeated for 'shoes' and 'leather'. By making * lazy, we avoid these unnecessary backtracks.

Jasmeet
Jasmeet, see my comment on a RegExp very close to yours, which was Alan Moore's. See my comment beginning with "Works on...". Do you have an idea what the problem might be?
Volomike
@Volomike, I am not quite sure, especially since I can't even get Alan Moore's regex to compile on Perl. I get a error about nested quantifiers ( a quantifier like *,+ .. being enclosed in another quatifier), which is there to protect against massive backtracking. I know that Alan is using possesive quantifiers, which enables the regex to avoid extra backtracks. But perl still does not like it, and given that both Perl and PHP use NFA based regex engines , I suspect you might be running into a similar problem.
Jasmeet
+1  A: 

I don't know about the absolute fastest way, but this is probably the fastest way to do it with a regex:

'#(?:\b(?>great\b()|leather\b()|shoes\b()|\w++\b)\W*+)++\1\2\3#i'

This matches every word in the string, and if the word happens to be one of your keywords, the empty capturing group "checks it off". Once all the words in the string have been matched, the back-references (\1\2\3) ensure that each of the three keywords has been seen at least once.

The lookahead-based approach that's usually recommended for this kind of task needs to scan potentially the whole string multiple times--once for each keyword. This regex only has to scan the string once--in fact, backtracking is disabled by the possessive quantifiers (++, *+) and atomic groups ((?>...)).

That said, I would still go with the lookahead approach unless I knew it it was causing a bottleneck. In most cases, its greater readability is worth the trade-off in performance.

Alan Moore
Wow, this is extremely impressive! I will take your advice, though, and go with the more readable one so that future programmers don't get upset.
Volomike
Works on several 1 to 3 word keyword phrases. But when I had a $KP of "radio night", a $RegExp of '#(?:\b(?>radio\b()|night\b()|\w++\b)\W*+)++\1\2\3#i', and a $Title of 'history of the media radio and television', I received the error "Compilation failed: reference to non-existent subpattern at offset 48". I can fix with a try/catch block, but probably should fix the RegExp bug first, right?
Volomike
You only have two capturing groups in that regex, so you need to get rid of the `\3`.
Alan Moore
how would you do this regex expression dynamically. Let's say I want it to run through a database of strings, setting the number of words in the string dynamically?
pfunc
@pfunc: I would just create an alternative for each word like this: `$word . '\b()|'`, plug them into the above structure, and add the requisite number of back-references. Make sure every word starts and ends with a word character (`[A-Za-z0-9_]`) so the `\b` boundaries work right. But if you're using procedural code anyway--to generate the regex--you should consider coding the solution directly. It would be more maintainable and more scalable that way.
Alan Moore