tags:

views:

91

answers:

1

I want to extract Urdu phrases out of a user-submitted string in PHP. For this, I tried the following test code:

$pattern = "#([\x{0600}-\x{06FF}]+\s*)+#u";
if (preg_match_all($pattern, $string, $matches, PREG_SET_ORDER)) {
    print_r($matches);
} else {
    echo 'No matches.';
}

Now if, for example, $string contains

In his books (some of which include دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), Ibn-e-Insha has told amusing stories of his travels.

I get the following output:

Array
(
    [0] => Array
        (
            [0] => دنیا گول ہے
            [1] => ہے
        )

    [1] => Array
        (
            [0] => آوارہ گرد کی ڈائری
            [1] => ڈائری
        )

    [2] => Array
        (
            [0] => ابن بطوطہ کے تعاقب میں
            [1] => میں
        )

)

Even though I get my desired matches (دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), I also get undesired ones (ہے, ڈائری, and میں -- each of which is actually the last word of its phrase). Can anyone please point out how I can avoid the undesired matches?

+2  A: 

That's because the capturing group ([\x{0600}-\x{06FF}]+\s*) is matched multiple times,each time overwriting what it matched the previous time. You could get the expected output by simply converting it to a non-capturing group -- (?:[\x{0600}-\x{06FF}]+\s*) -- but here's a more correct alternative:

$pattern = "#(?:[\x{0600}-\x{06FF}]+(?:\s+[\x{0600}-\x{06FF}]+)*)#u";

The first [\x{0600}-\x{06FF}]+ matches the first word, then if there's some whitespace followed by another word, (?:\s+[\x{0600}-\x{06FF}]+)* matches it and any subsequent words. But it doesn't match any whitespace after the last word, which I presume you don't want.

Alan Moore
Thanks, Alan M. It works exactly as I wanted to. I'll read up more on non-capturing groups.
Saadat