views:

609

answers:

4

I have the following code

$str = "keyword keyword 'keyword 1 and keyword 2' another 'one more'".'"another keyword" yes,one,two';

preg_match_all('/"[^"]+"|[^"\' ,]+|\'[^\']+\'/',$str,$matches);

echo "<pre>"; print_r($matches); echo "</pre>";

Where I want it to extract keywords from a string, and keep those wrapped within single or double quotes together, this above code works OK, but it returns the values with the quotes in it. I know I can remove these via str_replace or similar, but I'm really looking for a way to solve this via the preg_match_all function.

Output:

Array
(
    [0] => Array
        (
            [0] => keyword
            [1] => keyword
            [2] => 'keyword 1 and keyword 2'
            [3] => another
            [4] => 'one more'
            [5] => "another keyword"
            [6] => yes
            [7] => one
            [8] => two
        )

)

Also, I think my regex is a little be soppy, so any suggestions for a better would would be good :)

Any suggestions / help would be greatly appreciated.

A: 

Take a look at this tokenizeQuote function in the comments to the strtok function.

Edit   You need to modify the function because the original only works with double quotes:

function tokenizeQuoted($string)
{
    for ($tokens=array(), $nextToken=strtok($string, ' '); $nextToken!==false; $nextToken=strtok(' ')) {
        $firstChar = $nextToken{0};
        if ($firstChar === '"' || $firstChar === "'") {
            $nextToken = $nextToken{strlen($nextToken)-1} === $firstChar
                ? substr($nextToken, 1, -1)
                : substr($nextToken, 1) . ' ' . strtok($firstChar);
        }
        $tokens[] = $nextToken;
    }
    return $tokens;
}


Edit  Maybe you should just write your own parser:  

$tokens = array();
$buffer = '';
$quote = null;
$len = strlen($str);
for ($i=0; $i<$len; $i++) {
    $char = $str{$i};
    if ($char === '"' || $char === "'") {
        if ($quote === null) {
            if ($buffer !== '') {
                $tokens[] = $buffer;
                $buffer = '';
            }
            $quote = $char;
            continue;
        }
        if ($quote == $char) {
            $tokens[] = $buffer;
            $buffer = '';
            $quote = null;
            continue;
        }
    } else if ($char === ',' || $char === ' ') {
        if ($quote === null) {
            if ($buffer !== '') {
                $tokens[] = $buffer;
                $buffer = '';
            }
            continue;
        }
    }
    $buffer .= $char;
}
if ($buffer !== '') {
    $tokens[] = $buffer;
}
Gumbo
Not quite what I'm looking for, as I would like it done with preg_match_all, but thank you. (Also the function doesn't work with single quotes)
Jamie Bicknell
But again, it doesn't take into account commas like my regex, only spaces. I am convinced that the best way would be to use preg_match_all, but if it can't be done, then I'll settle for a substitute.
Jamie Bicknell
+1  A: 
preg_match_all('/"([^"]+)"|[^"\' ,]+|\'([^\']+)\'/',$str,$matches);

and use $matches[1] and $matches[2].

chaos
It would need to be:preg_match_all('/"([^"]+)"|([^"\' ,]+)|\'([^\']+)\'/',$str,$matches);and use $matches[1],$matches[2], and $matches[3] which again would require more manipulation after the preg_match_all function, so it would be easier to array_map a str_replace function that to merge the active instance of the array into one array.
Jamie Bicknell
How would you suggest that you join up the different arrays of results?
Jamie Bicknell
There's no native collating array merge function, so I'd write one, I suppose. I don't fully understand what your output requirements are so it's hard to say what's most appropriate.
chaos
A: 

this requires a simple function to get what you want, but it works

preg_match_all('/"([^"]+)"|([^"\' ,]+)|\'([^\']+)\'/',$str,$matches);
function r($str) {
    return str_replace(array('\'','"'), array(''), $str);
}
$a = array_map('r', $matches[0]);
print_r($a);
Galen
Thank you, I have already looked into this but creates unnecessary workload. Thank you for your input though Galen
Jamie Bicknell
A: 

You've almost got it; you just need to use lookarounds to match the quotes:

'/(?<=\')[^\'\s][^\']*+(?=\')|(?<=")[^"\s][^"]*+(?=")|[^\'",\s]+/'
Alan Moore
Superb!!!!! This is exactly what I needed! Thank you so much Alan M. Have just been trying to understand the regex you've used, and its beginning to make sense. To be honest, I've never come across the = before. Thanks again, really appreciate it
Jamie Bicknell
You might want to read this: http://www.regular-expressions.info/lookaround.html That whole site is excellent.
Alan Moore