ansaurus

Question

How to correctly parse a mixed latin/ideographic full text query with regex?

Answer 1

+1 A:

The problem appears to be with the regex [^\12544-\65519]. That looks like it's supposed to be a range defined by two, five-digit octal escapes, but it doesn't work that way. The actual breakdown is like this:

\125 => octal escape for 'U'
4    => '4'
4    => '4'
-
\655 => octal escape for... (something)
1    => '1'
9    => '9'

Which is effectively the same as:

[^14-\655]

What \655 means as the top of a range isn't clear, but the character class matches anything except a '1', a '4', or any ASCII character with a code point higher than '4' (which includes '9' and 'U'). It doesn't really matter though; the important point is that octal escapes can contain a maximum of three digits, which makes them unsuitable for your needs. I suggest you use PHP's \x{nnn} hexadecimal notation instead.

Alan Moore 2009-07-05 01:08:31

Thanks Alan, I've been trying to use \x{nnn} as you suggested but I can't seem to find any documentation on how to match an entire range of hexadecimal notation. I can match a single character, but to block out an entire range I cannot seem to find the solution. Any suggestions? I've also been trying to use the p{Latin} notation but it's giving me trouble because it's giving me similar problems. eg:"hello 1234567890 蓄積した abc123def" will split to:"hell 1234567890 ab 23def" when using preg_replace('#\P{Nd}\P{Latin}#u', ' ', $str)

justinl 2009-07-06 19:26:05

I'm still not clear on what you're trying to do. Can you show us the correct output for your test case? Please edit your question and put it there, not in a comment.

Alan Moore 2009-07-06 20:23:49

Thank you Alan, I have updated my question and will provide further additions/questions within the body of the question instead of this comment field.

justinl 2009-07-07 05:32:34

Answer 2

+1 A:

I'm not set up to work with either PHP or Chinese, so I can't give you a definitive answer, but this should at least help you refine the question. As I see it, it's basically a four-step process:

get rid of undesirable characters like punctuation, replacing them with whitespace
normalize whitespace: get rid of leading and trailing spaces, and collapse runs of two or more spaces to one space
normalize case: replace any uppercase letters with their lowercase equivalents
wherever a Chinese character is next to another non-whitespace character, separate the two characters with a space

For the first three steps, the first line of the code you posted should suffice:

$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str)));

For the final step, I would suggest lookarounds:

$str = preg_replace(
    '#(?<=\S)(?=\p{Chinese})|(?<=\p{Chinese})(?=\S)#u',
    ' ', $str);

That should insert a space at any position where the next character is Chinese and the previous character is not whitespace, or the previous character is Chinese and the next character is not whitespace.

Alan Moore 2009-07-07 21:55:46

I tried using your lookaround method but when I tried it on my string (after replacing {Chinese} with the approrpiate {Han} Unicode Script) it ended up parsing out the string entirely. I continued to experiment with Unicode scripts and ended up with something I was happy with which I posted below. Thanks for all your help!

justinl 2009-07-07 23:53:15

Answer 3

A:

After further research and the help of Alan's comments I was able to find the correct regex combinations to achieve a query parsing function for seperating lating and ideographic (chinese/japanese) characters that I'm happy with:

function prepareString($str) {
    $str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}]+#u', ' ', $str)));
    return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#\p{Han}#u', ' ', $str) . ' ' . implode(' ', preg_split('#\P{Han}?#u', $str, -1, PREG_SPLIT_NO_EMPTY))));
}

$query = "米娜Mi-NaNa日系時尚館╭☆ 旅行 渡假風格 【A6402】korea拼接條紋口袋飛鼠棉"

echo prepareString($query); //"mi nana a6402 korea 米 娜 日 系 時 尚 館 旅 行 渡 假 風 格 拼 接 條 紋 口 袋 飛 鼠 棉"

Disclaimer: I cannot read mandarin and the string above was copied from a Chinese website. if it says anything offensive please let me know and I will remove it.

justinl 2009-07-07 23:51:16

ansaurus

tags:

views:

answers:

How to correctly parse a mixed latin/ideographic full text query with regex?

related questions