views:

285

answers:

3

I am trying to match quoted strings within a piece of text and allowing for escaped quotes within it as well. I tried this regular expression in an online tester, and it works perfectly. However, when I try it in preg_match_all, it fails at the first escaped string.

Here is the code:

$parStr = 'title="My Little Website"
    year="2007"
    description="Basic website with ..."
    tech="PHP, mySQL"
    link="<a href=\"http://test.com\"&gt;test.com&lt;/a&gt;"
';
$matches = array();

preg_match_all('/(\w+)\s*=\s*"(([^\\"]*(\\.)?)*)"/', $parStr, $matches, PREG_SET_ORDER); // Match[4][0] is 'link="<a href=\"'

It fails on the last match, only matching up until the first escaped quote. When I try this expression at http://www.regexplanet.com/simple/index.html, it works perfectly.

The pertinent part of the regex is:

"(([^\\"]*(\\.)?)*)"

Which should eat all text leading up to an escaped quote or quote, followed by eating an optional escaped quote, of which process is repeated 0 or more times, until a non-escaped quote is found, in which the match is complete.

Why will this not work in php? Why does it not work and how should it be fixed?

A: 

I tried it on Linux Fedora PHP 5.2.6 and it seems to work fine. The output is:

[wally@zf ~]$ php -f z.php
title="My Little Website"
    year="2007"
    description="Basic website with ..."
    tech="PHP, mySQL"
    link="<a href=\"http://test.com\"&gt;test.com&lt;/a&gt;"
wallyk
Curious, because it doesn't work on mine - Ubuntu 9.10 with PHP 5.2.10
Scott Daniels
A: 

How about like this?

preg_match_all('/(\w+)\s*=\s*"((?:.*?\"?)*)"/', $parStr, $matches, PREG_SET_ORDER);

Its give me like this

[1] => link
[2] => <a href=\"http://test.com\"&gt;test.com&lt;/a&gt;

Inside [], everything consider as single char,

for [^\\"], it does not mean EXCEPT \", its mean EXCEPT \ AND EXCEPT "

UPDATE for Multiple Value in Same Line

preg_match_all('/(\w+)\s*=\s*"((?:[^\\\]*?(?:\\\")?)*?)"/', $parStr, $matches, PREG_SET_ORDER);

Source String, Sample

$parStr = 'title="My Little Website" year="2007" description="Basic website with ..." tech="PHP, mySQL" tech="PHP, mySQL" link="test.com" link="test.com" tech="PHP, mySQL" ';

Matches,

Array
(
    [0] => Array
        (
            [0] => title="My Little Website"
            [1] => title
            [2] => My Little Website
        )

    [1] => Array
        (
            [0] => year="2007"
            [1] => year
            [2] => 2007
        )

    [2] => Array
        (
            [0] => description="Basic website with ..."
            [1] => description
            [2] => Basic website with ...
        )

    [3] => Array
        (
            [0] => tech="PHP, mySQL"
            [1] => tech
            [2] => PHP, mySQL
        )

    [4] => Array
        (
            [0] => tech="PHP, mySQL"
            [1] => tech
            [2] => PHP, mySQL
        )

    [5] => Array
        (
            [0] => link="<a href=\"http://test.com\"&gt;test.com&lt;/a&gt;"
            [1] => link
            [2] => <a href=\"http://test.com\"&gt;test.com&lt;/a&gt;
        )

    [6] => Array
        (
            [0] => link="<a href=\"http://test.com\"&gt;test.com&lt;/a&gt;"
            [1] => link
            [2] => <a href=\"http://test.com\"&gt;test.com&lt;/a&gt;
        )

    [7] => Array
        (
            [0] => tech="PHP, mySQL"
            [1] => tech
            [2] => PHP, mySQL
        )

)

Personally, I feels like parsing HTML with regex, not really liked, but I don't see any other option to suggest you, so Its just a quick and dirty way. For big project or big files, I suggest you to find a real parser for that.

S.Mark
This code does not work when there are multiple quoted strings on a single line. It fails to find the 2nd closing quote. I think the problem is in \"?, the ? only pertains to the ", but not the \.
Scott Daniels
I have added another regex
S.Mark
A: 

I do not know why it doesn't work for one particular version of php, but using the idea of a non-greedy match, I came up with this string which does work:

"(.*?[^\])"

It non-greedily matches everything until it encounters a dbl-quote that is not preceded by an escape char. For some peculiar reason, three backslashes are needed or php complains of an unmatched bracket. I am thinking that its presence requires a backslash to precede the bracket, but I am not sure. Can anyone confirm why three backslashes are needed?

Scott Daniels