views:

1085

answers:

5

There are heaps of Qs about this on this forum and on the web in general. But I don't just get it.

Here is my code:

function updateGuideKeywords($dal)
{
    $pattern = "/[^a-zA-Z-êàé]/";
    $keywords = preg_replace($pattern, '', $_POST['keywords']);
    echo json_encode($keywords);
}

Now, the input is Prêt-à-porter, and the output is "Pr\u00eat-\u00e0-porter".

Why do I get the '\u00e' ?

And how can I alter my pattern to include the characters ê, à and é ?

EDIT
humm... since it looks like a unicode / character issue, I might go for the solution I found on this page.

Here they suggest doing something like this:

$chain="prêt-à-porter";

$pattern = array("'é'", "'è'", "'ë'", "'ê'", "'É'", "'È'", "'Ë'", "'Ê'", "'á'", "'à'", "'ä'", "'â'", "'å'", "'Á'", "'À'", "'Ä'", "'Â'", "'Å'", "'ó'", "'ò'", "'ö'", "'ô'", "'Ó'", "'Ò'", "'Ö'", "'Ô'", "'í'", "'ì'", "'ï'", "'î'", "'Í'", "'Ì'", "'Ï'", "'Î'", "'ú'", "'ù'", "'ü'", "'û'", "'Ú'", "'Ù'", "'Ü'", "'Û'", "'ý'", "'ÿ'", "'Ý'", "'ø'", "'Ø'", "'œ'", "'Œ'", "'Æ'", "'ç'", "'Ç'");

$replace = array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E', 'a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A', 'A', 'o', 'o', 'o', 'o', 'O', 'O', 'O', 'O', 'i', 'i', 'i', 'I', 'I', 'I', 'I', 'I', 'u', 'u', 'u', 'u', 'U', 'U', 'U', 'U', 'y', 'y', 'Y', 'o', 'O', 'a', 'A', 'A', 'c', 'C'); 

$chain = preg_replace($pattern, $replace, $chain);

EDIT 2
This is my solution so far:

function updateGuideKeywords()
{
    //First we replace characters with accents
    $pattern = array("'é'", "'è'", "'ë'", "'ê'", "'É'", "'È'", "'Ë'", "'Ê'", "'á'", "'à'", "'ä'", "'â'", "'å'", "'Á'", "'À'", "'Ä'", "'Â'", "'Å'", "'ó'", "'ò'", "'ö'", "'ô'", "'Ó'", "'Ò'", "'Ö'", "'Ô'", "'í'", "'ì'", "'ï'", "'î'", "'Í'", "'Ì'", "'Ï'", "'Î'", "'ú'", "'ù'", "'ü'", "'û'", "'Ú'", "'Ù'", "'Ü'", "'Û'", "'ý'", "'ÿ'", "'Ý'", "'ø'", "'Ø'", "'œ'", "'Œ'", "'Æ'", "'ç'", "'Ç'");
    $replace = array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E', 'a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A', 'A', 'o', 'o', 'o', 'o', 'O', 'O', 'O', 'O', 'i', 'i', 'i', 'I', 'I', 'I', 'I', 'I', 'u', 'u', 'u', 'u', 'U', 'U', 'U', 'U', 'y', 'y', 'Y', 'o', 'O', 'a', 'A', 'A', 'c', 'C');        $shguideID = $_POST['shguideID'];
    $keywords = preg_replace($pattern, $replace, $_POST['keywords']);
    //Then we remove unwanted characters by only allowing a-z, A-Z, comma, 'minus' and white space
    $keywords = preg_replace("/[^a-zA-Z-,\s]/", "", $keywords);

    echo json_encode($keywords);
}
A: 

this may not be 100% accurate, but looking at the regex your using i don't think preg_replace() is the issue. I think the reason you are getting '\u00e' is due to php's poor support of character encodings.

shsteimer
humm..ok, so I might be better of using the solution I found here then: http://www.codeguru.com/forum/archive/index.php/t-318709.html
Steven
+4  A: 

"Pr\u00eat-\u00e0-porter" is a correct JavaScript string literal representation of Prêt-à-porter. I assume you're doing a json_encode at some point along the line?

Note also that PHP's regular expressions are not Unicode-aware, so if you are using UTF-8 (which generally you want to be), the character ê is not a single character, but byte C3 followed by byte AA. That's fine for simple literal matches, but in situations like a character class you're now matching two bytes separately instead of one after each other, which can easily mess up your expression.

bobince
Yes, you are correct. I'm returning the result with `echo json_encode($keywords);`, then I `alert` the result.
Steven
So maybe it's a better solution for me to replace special characters like `è` with `e` when saving it to database?
Steven
Why do you want to replace/remove perfectly good Unicode characters?
bobince
A: 

From what I see of your output, your characters are not removed (hence in your pattern), so the only thing is that the output is made in unicode. Try to change your document to UTF-8 or encode HTML entities and it should work, but beware if you encode entities before replacing, it won't detect them as they will be already converted.

Wolf
Using FireFox's FireBug, I can see that the server result is `"Pr\u00eat-\u00e0-porter"`. So I think it's the result from `preg_replace()` that "screws" up my characters (?).
Steven
You're right, after further reading, preg_replace() doesn't handle Unicode too well. Thankfully, PHP6 adds full Unicode support with conversion functions and all.
Wolf
A: 

Your code, with the latest edits so far, works this way:

  1. The expression /[^a-zA-Z-êàé]/ means "match anything that's not English letter, minus sign, ê, à or é".

  2. preg_replace($pattern, '', 'Prêt-à-porter') returns 'Prêt-à-porter' since nothing matches.

  3. json_encode() returns the JSON representation of 'Prêt-à-porter', which is 'r\u00eat-\u00e0-porter'

It's not clear to me what's your exact goal. If you want to remove anything that's not a minus or letter you can try this pattern:

/[^\w0-9]/u
Álvaro G. Vicario
Yup. So I have added a possible solution. I think storing normal characters in DB for alter use in auto-complete is the best option. And by only allowing certain characters, I prevent SQL injection.
Steven
You prevent SQL injection by using bind parameters or the appropriate escape functions your DB library offers. Just figure out databases were not able to store non-ASCII stuff!
Álvaro G. Vicario
A: 

If you want to replace 'é' with 'e', etc. use iconv() with the //TRANSLIT modifier

e.g.,

$newString = iconv('UTF-8', 'ASCII//TRANSLIT', $myString);

A more complete example:

$ cat scratch.php
<?php
$x = "Prêt-à-porter";
var_dump(json_encode(iconv("UTF-8", "ASCII//TRANSLIT", $x)));


$ php scratch.php
string(15) ""Pret-a-porter""
$ 
Frank Farmer