tags:

views:

71

answers:

1

Using PHP, I'm trying to improve the search on my site by supporting Google like operators e.g.

  • keyword = natural/default
  • "keyword" or "search phrase" = exact match
  • keyword* = partial match

For this to work I need to to split the string into two arrays. One for the exact words (but without the double quotes) into $Array1() and put everything else (natural and partial keywords) into Array2().

What regular expressions would achieve this for the following string?


Example string ($string)

today i'm "trying" out a* "google search" "test"

Desired result

$Array1 = array(
  [0]=>trying
  [1]=>google search
  [2]=>testing
);

$Array2 = array(
  [0]=>today
  [1]=>i'm
  [2]=>out
  [3]=>a*
);

1) Exact I've tried the following for the exact regexp but it returns two arrays, one with and one without the double quotes. I could just use $result[1] but there could be a trick that I'm missing here.

preg_match_all(
    '/"([^"]+)"/iu', 
    'today i\'m "trying" \'out\' a* "google search" "test"', 
    $result
);

2) Natural/Partial The following rule returns the correct keywords, but along with several blank values. This regexp rule maybe sloppy or should I just run the array through array_filter()?

preg_split(
    '/"([^"]+)"|(\s)/iu', 
    'today i\'m "trying" \'out\' a* "google search" "test"'
);
+2  A: 

You can use strtok to tokenize the string.

See for example this tokenizeQuoted function derived from this tokenizedQuoted function in the comments on the strtok manual page:

// split a string into an array of space-delimited tokens, taking double-quoted and single-quoted strings into account
function tokenizeQuoted($string, $quotationMarks='"\'') {
    $tokens = array(array(),array());
    for ($nextToken=strtok($string, ' '); $nextToken!==false; $nextToken=strtok(' ')) {
        if (strpos($quotationMarks, $nextToken[0]) !== false) {
            if (strpos($quotationMarks, $nextToken[strlen($nextToken)-1]) !== false) {
                $tokens[0][] = substr($nextToken, 1, -1);
            } else {
                $tokens[0][] = substr($nextToken, 1) . ' ' . strtok($nextToken[0]);
            }
        } else {
            $tokens[1][] = $nextToken;
        }
    }
    return $tokens;
}

Here’s an example of use:

$string = 'today i\'m "trying" out a* "google search" "test"';
var_dump(tokenizeQuoted($string));

The output:

array(2) {
  [0]=>
  array(3) {
    [0]=>
    string(6) "trying"
    [1]=>
    string(13) "google search"
    [2]=>
    string(4) "test"
  }
  [1]=>
  array(4) {
    [0]=>
    string(5) "today"
    [1]=>
    string(3) "i'm"
    [2]=>
    string(3) "out"
    [3]=>
    string(2) "a*"
  }
}
Gumbo
Gumbo, thank you! This works great for me. I wasn't aware of strtok() and it's a great solution.
Adam
This helped me out a bunch too. +1
Dutchie432