tags:

views:

160

answers:

5

I need to split string by spaces, but phrase in quotes should be preserved unsplitted. Example:

  word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5

this should result in array after preg_split:

array(
 [0] => 'word1',
 [1] => 'word2',
 [2] => 'this is a phrase',
 [3] => 'word3',
 [4] => 'word4',
 [5] => 'this is a second phrase',
 [6]  => 'word5'
)

How should I compose my regexp to do that?

PS. There is related question, but I don't think it works in my case. Accepted answer provides regexp to find words instead of whitespaces.

A: 

assuming your quotes are well defined, ie, in pairs, you can explode and go through for loop every 2 fields. eg

$str = "word1 word2 \"this is a phrase\" word3 word4 \"this is a second phrase\" word5 word6 \"lastword\"";
print $str ."\n";
$s = explode('"',$str);
for($i=1;$i<count($s);$i+=2){
    if ( strpos($s[$i] ," ")!==FALSE) {
        print "Spaces found: $s[$i]\n";
    }
}

output

$ php test.php
Spaces found: this is a phrase
Spaces found: this is a second phrase

No complicated regexp required.

ghostdog74
Sure thing I could do this without regexp, but this is not my case.
altern
A: 

using the regex from the other question you linked this is rather easy?

<?php

$string = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

preg_match_all( '/(\w+|"[\w\s]*")+/' , $string , $matches );

print_r( $matches[1] );

?>

output:

Array
(
     [0] => word1
     [1] => word2
     [2] => "this is a phrase"
     [3] => word3
     [4] => word4
     [5] => "this is a second phrase"
     [6] => word5
)
edds
What about special character (ampersand for example) which also should be found? And not only ampersand will be unhandled. Moreover, different symbols should be handled differently. For example, if braces encountered, I need to include those in search results.
altern
@altern, well, I'm sure `edds` doesn't mind you adjust his example to your needs...
Bart Kiers
A: 

Anybody want to benchmark tokenizing vs. regex? My guess is the explode() function is a little too hefty for any speed benefit. Nonetheless, here's another method:

(edited because I forgot the else case for storing the quoted string)

$str = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

// initialize storage array
$arr = array();
// initialize count
$count = 0;
// split on quote
$tok = strtok($str, '"');
while ($tok !== false) {
    // even operations not in quotes
    $arr = ($count % 2 == 0) ? 
                               array_merge($arr, explode(' ', trim($tok))) :
                               array_merge($arr, array(trim($tok)));
    $tok = strtok('"');
    ++$count;
}

// output results
var_dump($arr);
cballou
+2  A: 

With the help of user MizardX from #regex irc channel (irc.freenode.net) solution was found. It even supports single quotes.

$str= 'word1 word2 \'this is a phrase\' word3 word4 "this is a second phrase" word5 word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

$regexp = '/\G(?:"[^"]*"|\'[^\']*\'|[^"\'\s]+)*\K\s+/';

$arr = preg_split($regexp, $str);

print_r($arr);

Result is:

Array (
    [0] => word1
    [1] => word2
    [2] => 'this is a phrase'
    [3] => word3
    [4] => word4
    [5] => "this is a second phrase"
    [6] => word5
    [7] => word1
    [8] => word2
    [9] => "this is a phrase"
    [10] => word3
    [11] => word4
    [12] => "this is a second phrase"
    [13] => word5  
)

PS. Only disadvantage is that this regexp works only for PCRE 7.

It turned out that I do not have PCRE 7 support on production server, only PCRE 6 is installed there. Regexp that will work in this case is (got rid of \G and \K), thought it is not as flexible as previous one for PCRE 7.

/(?:"[^"]*"|\'[^\']*\'|[^"\'\s]+)+/

For the given input result is the same as above.

altern
Update - nice edit :-)
richsage
what does \G and \K stand for?
Amarghosh
`\G` anchors the match to the place where the previous match ended (roughly speaking), or to the beginning of the input if there was no previous match. `\K` I had to look up: it means "pretend the match really started here"; although the regex matches a token and the whitespace following it, it acts like it only matched the whitespace. Sort of a poor man's lookbehind, only it seems like it would be superior to lookbehind in most cases. Why isn't that feature more common, I wonder? http://www.pcre.org/pcre.txt
Alan Moore
Thanks Alan. Couldn't find both in regex.info... and it's so hard to google for regex.
Amarghosh
A: 
$test = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';
preg_match_all( '/([^"\s]+)|("([^"]+)")/', $test, $matches);
Amarghosh