I'm trying to sanitize/format some input using regex for a mixed latin/ideographic(chinese/japanse/korean) full text search.
I found an old example of someone's attempt at sanitizing a latin/asian language string on a forum of which I cannot find again (full credit to the original author of this code).
I am having trouble fully understanding the regex portion of the function in particular why it seems to be treating the numbers 0, 2, and 3 differently than the rest of the latin based numbers 1,4-9 (basically it treats the numbers 0,4-9 properly, but the numbers 0,2-3 in the query are treated as if they are Asian characters).
For example. I am trying to sanitize the following string:
"hello 1234567890 蓄積した abc123def"
and it will turn into:
"hello 1 456789 abc1 def 2 3 0 蓄 積 し た 2 3"
the correct output for this sanitized string should be:
"hello 1234567890 蓄 積 し た abc123def"
As you can see it properly spaces out the Asian characters but the numbers 0, 2, 3 are treated differently than all other number. Any help on why the regex is treating those numbers 0,2 and 3 differently would be a great help (or if you know of a better way of achieving a similar result)! Thank you
I have included the function below
function prepareString($str) { $str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str))); return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#([^\12544-\65519])#u', ' ', $str) . ' ' . implode(' ', preg_split('#([\12544-\65519\s])?#u', $str, -1, PREG_SPLIT_NO_EMPTY)))); }
UPDATE: Providing context for clarity
I am authoring a website that will be launched in China. This website will have a search function and I am trying to write a parser for the search query input.
Unlike the English language which uses a " " as the delimiter between words in a sentence, Chinese does not use spaces between words. Because of this, I have to re-format a search query by breaking apart each Chinese character and searching for each character individually within the database. Chinese users will also use latin/english characters for things such as brand names which they can mix together with their Chinese characters (eg. Ivy牛仔舖).
What I would like to do is separate all of the English words out from the Chinese characters, and Seperate each Chinese character with a space.
A search query could look like this: Ivy牛仔舖
And I would want to parse it so that it looks like this: Ivy 牛 仔 舖