ansaurus

Question

Extracting Urdu/Arabic phrases/sentences from a string

Answer 1

+2 A:

That's because the capturing group ([\x{0600}-\x{06FF}]+\s*) is matched multiple times,each time overwriting what it matched the previous time. You could get the expected output by simply converting it to a non-capturing group -- (?:[\x{0600}-\x{06FF}]+\s*) -- but here's a more correct alternative:

$pattern = "#(?:[\x{0600}-\x{06FF}]+(?:\s+[\x{0600}-\x{06FF}]+)*)#u";

The first [\x{0600}-\x{06FF}]+ matches the first word, then if there's some whitespace followed by another word, (?:\s+[\x{0600}-\x{06FF}]+)* matches it and any subsequent words. But it doesn't match any whitespace after the last word, which I presume you don't want.

Alan Moore 2009-08-30 13:41:55

Thanks, Alan M. It works exactly as I wanted to. I'll read up more on non-capturing groups.

Saadat 2009-08-30 16:44:42

ansaurus

tags:

views:

answers:

Extracting Urdu/Arabic phrases/sentences from a string

related questions