views:

209

answers:

3

I'm trying to sanitize/format some input using regex for a mixed latin/ideographic(chinese/japanse/korean) full text search.

I found an old example of someone's attempt at sanitizing a latin/asian language string on a forum of which I cannot find again (full credit to the original author of this code).

I am having trouble fully understanding the regex portion of the function in particular why it seems to be treating the numbers 0, 2, and 3 differently than the rest of the latin based numbers 1,4-9 (basically it treats the numbers 0,4-9 properly, but the numbers 0,2-3 in the query are treated as if they are Asian characters).

For example. I am trying to sanitize the following string:
"hello 1234567890 蓄積した abc123def"

and it will turn into:
"hello 1 456789 abc1 def 2 3 0 蓄 積 し た 2 3"

the correct output for this sanitized string should be:
"hello 1234567890 蓄 積 し た abc123def"

As you can see it properly spaces out the Asian characters but the numbers 0, 2, 3 are treated differently than all other number. Any help on why the regex is treating those numbers 0,2 and 3 differently would be a great help (or if you know of a better way of achieving a similar result)! Thank you

I have included the function below


function prepareString($str) {
$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str)));

return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#([^\12544-\65519])#u', ' ', $str) . ' ' . implode(' ', preg_split('#([\12544-\65519\s])?#u', $str, -1, PREG_SPLIT_NO_EMPTY))));
}

UPDATE: Providing context for clarity

I am authoring a website that will be launched in China. This website will have a search function and I am trying to write a parser for the search query input.

Unlike the English language which uses a " " as the delimiter between words in a sentence, Chinese does not use spaces between words. Because of this, I have to re-format a search query by breaking apart each Chinese character and searching for each character individually within the database. Chinese users will also use latin/english characters for things such as brand names which they can mix together with their Chinese characters (eg. Ivy牛仔舖).

What I would like to do is separate all of the English words out from the Chinese characters, and Seperate each Chinese character with a space.

A search query could look like this: Ivy牛仔舖

And I would want to parse it so that it looks like this: Ivy 牛 仔 舖

+1  A: 

The problem appears to be with the regex [^\12544-\65519]. That looks like it's supposed to be a range defined by two, five-digit octal escapes, but it doesn't work that way. The actual breakdown is like this:

\125 => octal escape for 'U'
4    => '4'
4    => '4'
-
\655 => octal escape for... (something)
1    => '1'
9    => '9'

Which is effectively the same as:

[^14-\655]

What \655 means as the top of a range isn't clear, but the character class matches anything except a '1', a '4', or any ASCII character with a code point higher than '4' (which includes '9' and 'U'). It doesn't really matter though; the important point is that octal escapes can contain a maximum of three digits, which makes them unsuitable for your needs. I suggest you use PHP's \x{nnn} hexadecimal notation instead.

Alan Moore
Thanks Alan, I've been trying to use \x{nnn} as you suggested but I can't seem to find any documentation on how to match an entire range of hexadecimal notation. I can match a single character, but to block out an entire range I cannot seem to find the solution. Any suggestions? I've also been trying to use the p{Latin} notation but it's giving me trouble because it's giving me similar problems. eg:"hello 1234567890 蓄積した abc123def" will split to:"hell 1234567890 ab 23def" when using preg_replace('#\P{Nd}\P{Latin}#u', ' ', $str)
justinl
I'm still not clear on what you're trying to do. Can you show us the correct output for your test case? Please edit your question and put it there, not in a comment.
Alan Moore
Thank you Alan, I have updated my question and will provide further additions/questions within the body of the question instead of this comment field.
justinl
+1  A: 

I'm not set up to work with either PHP or Chinese, so I can't give you a definitive answer, but this should at least help you refine the question. As I see it, it's basically a four-step process:

  • get rid of undesirable characters like punctuation, replacing them with whitespace

  • normalize whitespace: get rid of leading and trailing spaces, and collapse runs of two or more spaces to one space

  • normalize case: replace any uppercase letters with their lowercase equivalents

  • wherever a Chinese character is next to another non-whitespace character, separate the two characters with a space

For the first three steps, the first line of the code you posted should suffice:

$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str)));

For the final step, I would suggest lookarounds:

$str = preg_replace(
    '#(?<=\S)(?=\p{Chinese})|(?<=\p{Chinese})(?=\S)#u',
    ' ', $str);

That should insert a space at any position where the next character is Chinese and the previous character is not whitespace, or the previous character is Chinese and the next character is not whitespace.

Alan Moore
I tried using your lookaround method but when I tried it on my string (after replacing {Chinese} with the approrpiate {Han} Unicode Script) it ended up parsing out the string entirely. I continued to experiment with Unicode scripts and ended up with something I was happy with which I posted below. Thanks for all your help!
justinl
A: 

After further research and the help of Alan's comments I was able to find the correct regex combinations to achieve a query parsing function for seperating lating and ideographic (chinese/japanese) characters that I'm happy with:

function prepareString($str) {
    $str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}]+#u', ' ', $str)));
    return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#\p{Han}#u', ' ', $str) . ' ' . implode(' ', preg_split('#\P{Han}?#u', $str, -1, PREG_SPLIT_NO_EMPTY))));
}

$query = "米娜Mi-NaNa日系時尚館╭☆ 旅行 渡假風格 【A6402】korea拼接條紋口袋飛鼠棉"

echo prepareString($query); //"mi nana a6402 korea 米 娜 日 系 時 尚 館 旅 行 渡 假 風 格 拼 接 條 紋 口 袋 飛 鼠 棉"

Disclaimer: I cannot read mandarin and the string above was copied from a Chinese website. if it says anything offensive please let me know and I will remove it.

justinl