views:

570

answers:

2

How can I explode the following string:

Lorem ipsum "dolor sit amet" consectetur "adipiscing elit" dolor

into

array("Lorem", "ipsum", "dolor sit amet", "consectetur", "adipiscing elit", "dolor")

So that the text in quotation is treated as a single word.

Here's what I have for now:

$mytext = "Lorem ipsum %22dolor sit amet%22 consectetur %22adipiscing elit%22 dolor"
$noquotes = str_replace("%22", "", $mytext");
$newarray = explode(" ", $noquotes);

but my code divides each word into an array. How do I make words inside quotation marks treated as one word?

A: 

Descriptive answer:

  1. Match all spaces between quotes and replace them with an underscore (preg_match case)
  2. use explode(" ",$result_from_step1)
  3. now replace the underscores with spaces

Rest i believe you can figure out easily :)

Regards,

andreas
How do I select the text or spaces between the quotes?
timofey
i would recommend using a regular expression for that.
andreas
this would lose data if your input string included underscores (since they would end up as spaces)
rmeador
+8  A: 

You could use a preg_match_all(...):

$text = 'Lorem ipsum "dolor sit amet" consectetur "adipiscing \\"elit" dolor';
preg_match_all('/"(?:\\\\.|[^\\\\"])*"|\S+/', $text, $matches);
print_r($matches);

which will produce:

Array
(
    [0] => Array
        (
            [0] => Lorem
            [1] => ipsum
            [2] => "dolor sit amet"
            [3] => consectetur
            [4] => "adipiscing \"elit"
            [5] => dolor
        )

)

And as you can see, it also accounts for escaped quotes inside quoted strings.

EDIT

A short explanation:

"           # match the character '"'
(?:         # start non-capture group 1 
  \\        #   match the character '\'
  .         #   match any character except line breaks
  |         #   OR
  [^\\"]    #   match any character except '\' and '"'
)*          # end non-capture group 1 and repeat it zero or more times
"           # match the character '"'
|           # OR
\S+         # match a non-whitespace character: [^\s] and repeat it one or more times

And in case of matching %22 instead of double quotes, you'd do:

preg_match_all('/%22(?:\\\\.|(?!%22).)*%22|\S+/', $text, $matches);
Bart Kiers
Is there a reason not to use `preg_split` instead of `preg_match_all`? it seems like a more natural fit IMO.
prodigitalson
That's Awesome! I'll have to study the code for a bit to figure what just happened! thanks
timofey
@prodigitalson: no, using `preg_split(...)` you cannot account for escaped characters. `preg_match_all(...)` "behaves" more like a parser which is the more natural thing to do here. Besides, using a `preg_split(...)`, you'll need to look ahead on each space to see how many quotes are ahead of it, making it an `O(n^2)` operation: no problem for small strings, but might decrease the runtime when larger strings are involved.
Bart Kiers
@timofey, see my edit. Don't hesitate to ask for more clarification if it's not clear to you: you're the one maintaining the code, so you should understand it (and I'm more than happy to provide extra information if it's needed).
Bart Kiers
Thanks Bart K.!I was already searching google for answers on that one:)
timofey
But then if I want to replace Lorem ipsum %22dolor sit amet%22 consectetur %22adipiscing elit%22 dolor (basically the quotation marks are listed as %22) the following doesn't seem to work:preg_match_all('/%22(?:\\\\.|[^\\\\"])*%22|\S+/', $text, $matches);
timofey
@timofey, see the edited version.
Bart Kiers
That's beginning to make sense! Thanks
timofey
No problem timofey.
Bart Kiers
#Bart K.: I see... Thanks for the info!
prodigitalson