ansaurus

Question

How could I find all whitespaces excluding the ones between quotes?

Answer 1

A:

assuming your quotes are well defined, ie, in pairs, you can explode and go through for loop every 2 fields. eg

$str = "word1 word2 \"this is a phrase\" word3 word4 \"this is a second phrase\" word5 word6 \"lastword\"";
print $str ."\n";
$s = explode('"',$str);
for($i=1;$i<count($s);$i+=2){
    if ( strpos($s[$i] ," ")!==FALSE) {
        print "Spaces found: $s[$i]\n";
    }
}

output

$ php test.php
Spaces found: this is a phrase
Spaces found: this is a second phrase

No complicated regexp required.

ghostdog74 2009-11-12 12:57:41

Sure thing I could do this without regexp, but this is not my case.

altern 2009-11-12 13:01:47

Answer 2

A:

using the regex from the other question you linked this is rather easy?

<?php

$string = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

preg_match_all( '/(\w+|"[\w\s]*")+/' , $string , $matches );

print_r( $matches[1] );

?>

output:

Array
(
     [0] => word1
     [1] => word2
     [2] => "this is a phrase"
     [3] => word3
     [4] => word4
     [5] => "this is a second phrase"
     [6] => word5
)

edds 2009-11-12 13:02:05

What about special character (ampersand for example) which also should be found? And not only ampersand will be unhandled. Moreover, different symbols should be handled differently. For example, if braces encountered, I need to include those in search results.

altern 2009-11-12 13:09:38

@altern, well, I'm sure `edds` doesn't mind you adjust his example to your needs...

Bart Kiers 2009-11-12 13:16:00

Answer 3

A:

Anybody want to benchmark tokenizing vs. regex? My guess is the explode() function is a little too hefty for any speed benefit. Nonetheless, here's another method:

(edited because I forgot the else case for storing the quoted string)

$str = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

// initialize storage array
$arr = array();
// initialize count
$count = 0;
// split on quote
$tok = strtok($str, '"');
while ($tok !== false) {
    // even operations not in quotes
    $arr = ($count % 2 == 0) ? 
                               array_merge($arr, explode(' ', trim($tok))) :
                               array_merge($arr, array(trim($tok)));
    $tok = strtok('"');
    ++$count;
}

// output results
var_dump($arr);

cballou 2009-11-12 13:03:39

Answer 4

+2 A:

With the help of user MizardX from #regex irc channel (irc.freenode.net) solution was found. It even supports single quotes.

$str= 'word1 word2 \'this is a phrase\' word3 word4 "this is a second phrase" word5 word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';

$regexp = '/\G(?:"[^"]*"|\'[^\']*\'|[^"\'\s]+)*\K\s+/';

$arr = preg_split($regexp, $str);

print_r($arr);

Result is:

Array (
    [0] => word1
    [1] => word2
    [2] => 'this is a phrase'
    [3] => word3
    [4] => word4
    [5] => "this is a second phrase"
    [6] => word5
    [7] => word1
    [8] => word2
    [9] => "this is a phrase"
    [10] => word3
    [11] => word4
    [12] => "this is a second phrase"
    [13] => word5  
)

PS. Only disadvantage is that this regexp works only for PCRE 7.

It turned out that I do not have PCRE 7 support on production server, only PCRE 6 is installed there. Regexp that will work in this case is (got rid of \G and \K), thought it is not as flexible as previous one for PCRE 7.

/(?:"[^"]*"|\'[^\']*\'|[^"\'\s]+)+/

For the given input result is the same as above.

altern 2009-11-12 13:04:29

Update - nice edit :-)

richsage 2009-11-12 13:12:22

what does \G and \K stand for?

Amarghosh 2009-11-12 13:38:05

`\G` anchors the match to the place where the previous match ended (roughly speaking), or to the beginning of the input if there was no previous match. `\K` I had to look up: it means "pretend the match really started here"; although the regex matches a token and the whitespace following it, it acts like it only matched the whitespace. Sort of a poor man's lookbehind, only it seems like it would be superior to lookbehind in most cases. Why isn't that feature more common, I wonder? http://www.pcre.org/pcre.txt

Alan Moore 2009-11-12 15:15:08

Thanks Alan. Couldn't find both in regex.info... and it's so hard to google for regex.

Amarghosh 2009-11-12 16:41:16

Answer 5

A:

$test = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';
preg_match_all( '/([^"\s]+)|("([^"]+)")/', $test, $matches);

Amarghosh 2009-11-12 13:11:16

ansaurus

tags:

views:

answers:

How could I find all whitespaces excluding the ones between quotes?

related questions