views:

577

answers:

8
A: 

Why not run the string through htmlspecialchars() and output it to see what it turns that character into, so you know what to use as your replace expression?

I tried that but nothing happens. That comma stays as it is :(
richard
+1  A: 

To replace it:

If your script file is encoded in the same encoding as the data you are trying to do the replacement in, it should work the way you posted it. If you're working with UTF-8 data, make sure the script is encoded in UTF-8 and it's not your editor silently transliterating the character when you paste it.

If it won't work, try escaping it as described below and see what code it returns.

To escape it:

If your source file is encoded in UTF-8, this should work:

$string = htmlentities($string, ENT_QUOTES, "UTF-8");

the default character set of html... is iso-8859-1. Anything differing from that must be explicitly stated.

For more complex character conversion issues, always check out the User Contributed Notes to functions like htmlentities(), there are often real gems to be found there.

In General:

Bobince is right in his comment, systemic character set problems should be sorted systematically so they don't bite you in the ass - if only by defining which character set is used on every step of the way:

  • How the script file is encoded;
  • How the document is served;
  • How the data is stored in the database;
  • How the database connection is encoded.
Pekka
The file is saved UTF-8. HTML meta tag is UTF-8. Database UTF-8. Database connection... Call to undefined function. That's unfortunate, I believe that might be the problem.
richard
+4  A: 

This had happend to me too. Couple of things:

  • Use htmlentities function for your text

    $my_text = htmlentities($string, ENT_QUOTES, 'UTF-8');

More info about the htmlentities function.

  • Use proper document type, this did the trick for me.

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;

  • Use utf-8 encoding type in your page:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Here is the final prototype for your page:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>    
<body>

<?php     
    // your code related to database        
    $my_text = htmlentities($string, ENT_QUOTES, 'UTF-8');    
?>

</body>
</html>

.

If you want to replace it however, try the mb_ereg_replace function.

Example:

mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");

$my_text = mb_ereg_replace("’","'", $string);
Sarfraz
Thanks for your answer. I was indeed missing a doctype or any other HTML in fact. Once I added it and your htmlentities rule the `’` became a `�`. Once I switched Firefox to Western ISO charset it changed back to `’`. str_replace still doesn't work and unfortunately mb_ereg_replace does not work either.
richard
Try this: `str_replace("’", "'", $string);` and also try removing the `htmlentites` function and then see.
Sarfraz
With the html entities function the whole record just disappears. If I remove the function it shows the `�`. My database is UTF-8 but I can change it in PHPMyAdmin?
richard
@richard: yup, you can change the encoding in phpmyadmin, there is an option once you select a database.
Sarfraz
@Sarfraz: what should I change it to? I tried setting it to utf8_unicode_ci but that didn't change anything in the output. Thanks.
richard
@richard: it should be set to `utf8_unicode_ci` to allow foreign languages but not sure which charset that character belongs to. You might want to give a try to any of the `latin` charset too.
Sarfraz
+1  A: 

If you are using non-ASCII characters in your PHP code, you need to make sure that you’re using the same character encoding as in the data you are processing. Your attempt probably fails because you are using a different character encoding in your PHP script than in $string.

Additionally, if you’re using a multibyte character encoding such as UTF-8, you should also use the multibyte aware string functions.

Gumbo
+2  A: 

To find what character it is, run it through the ord function, which will give you the ASCII code of the character:

echo ord('’'); // 226

Now that you know what it is, you can do this:

str_replace('’', chr(226), $string);
Casey Hope
This just replaces the character with a copy of itself.
kingjeffrey
Good point, but the original poster's code does that too, so I figured I'd do the same.
Casey Hope
A: 

This character you have is the Right Single Quotation Mark.

To replace it with a pattern you'll want to do something like this

$string = preg_replace( "/\\x{2019}/u", 'replacement', $string );

But that really only addresses the symptom. The problem is that you don't have consistent use of character encodings throughout your application, as others have noted.

Peter Bailey
With your replace pattern, the row contain the Right Single Quotation Mark returns empty. I wouldn't know what to change now. MySQL charset is UTF-8, collocation utf8_unicode_ci and html meta tag utf-8.
richard
I'm not really sure what you mean by "returns empty". Can you be more explicit?
Peter Bailey
I am quering 12 rows, 1 row contains that comma. 11 rows are being returned by PHP with this replace expression.
richard
PHP doesn't return rows. A SQL server does. And even then, `preg_replace()` operates on a single string - a column perhaps - not a "row". You are still confusing me. Did you try moving this regular expression into a SQL query?
Peter Bailey
preg_replace is a heavy function for a static character – without need complex regex pattern match, there is no reason for the comparatively slow preg_replace. str_replace would be much faster.
kingjeffrey
@kingjeffrey - that's totally a micro optimization. Accurate? Sometimes (preg_replace can be faster in unexpected ways). Relevant? No, I don't think so.
Peter Bailey
@Peter Bailey "that's totally a micro optimization"... str_replace() preforms 50% faster than preg_replace() for simple string replacement. I threw this benchmark together: http://test.kingdesk.com/preg-replace-v-str-replace/. For complex applications (such as the wp-Typography plugin: http://wordpress.org/extend/plugins/wp-typography/), it can save seconds off of complex parsing operations.
kingjeffrey
+1  A: 

Gumbo sad right -
- save your script as utf-8 file
- and use http://php.net/mbstring (as Sarfraz pointed in his last example)

mmcteam.com.ua
A: 

Don't use any regex functions ( preg_replace or mb_ereg_replace ). They are way to heavy for this.

str_replace(chr(226),'\u2019' , $string);

If your needle is a multibyte character, you may have better luck with this bespoke function:

<?php 
function mb_str_replace($needle, $replacement, $haystack) {
    $needle_len = mb_strlen($needle);
    $replacement_len = mb_strlen($replacement);
    $pos = mb_strpos($haystack, $needle);
    while ($pos !== false)
    {
        $haystack = mb_substr($haystack, 0, $pos) . $replacement
                . mb_substr($haystack, $pos + $needle_len);
        $pos = mb_strpos($haystack, $needle, $pos + $replacement_len);
    }
    return $haystack; 
} 
?>

credit for this last function: http://www.php.net/manual/en/ref.mbstring.php#86120

kingjeffrey