I want to extract Urdu phrases out of a user-submitted string in PHP. For this, I tried the following test code:
$pattern = "#([\x{0600}-\x{06FF}]+\s*)+#u";
if (preg_match_all($pattern, $string, $matches, PREG_SET_ORDER)) {
print_r($matches);
} else {
echo 'No matches.';
}
Now if, for example, $string
contains
In his books (some of which include دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), Ibn-e-Insha has told amusing stories of his travels.
I get the following output:
Array ( [0] => Array ( [0] => دنیا گول ہے [1] => ہے ) [1] => Array ( [0] => آوارہ گرد کی ڈائری [1] => ڈائری ) [2] => Array ( [0] => ابن بطوطہ کے تعاقب میں [1] => میں ) )
Even though I get my desired matches (دنیا گول ہے
, آوارہ گرد کی ڈائری
, and ابن بطوطہ کے تعاقب میں
), I also get undesired ones (ہے
, ڈائری
, and میں
-- each of which is actually the last word of its phrase). Can anyone please point out how I can avoid the undesired matches?