views:

1584

answers:

3

I need to replace Microsoft Word version of single and double quotations marks (“ ” ‘ ’) with regular quotes (' and ") due to an encoding issue in my application. I do not need them to be HTML entities and I cannot change my database schema.

I have two options: to use either a regular expression or an associated array.

Is there a better way to do this?

+7  A: 

Considering you only want to replace a few specific and well identified characters, I would go for str_replace with an array : you obviously don't need the heavy artillery regex will bring you ;-)

And if you encounter some other special characters (damn copy-paste from word...), you can just add them to that array whenever is necessary / whenever they are identified.


EDIT : the best answer I can give to your comment is probably this link : Convert Smart Quotes with PHP

And the associated code (quoting that page) :

function convert_smart_quotes($string) 
{ 
    $search = array(chr(145), 
                    chr(146), 
                    chr(147), 
                    chr(148), 
                    chr(151)); 

    $replace = array("'", 
                     "'", 
                     '"', 
                     '"', 
                     '-'); 

    return str_replace($search, $replace, $string); 
}

(I don't have MS word on this computer, so I can't test by myself)

I don't remember exactly what we used at work (I was not the one having to deal with that kind of input), but it was the same kind of stuff...

Pascal MARTIN
How would you specify the MS characters?
Misha M
This is what I was looking for. Thanks. The search array did not work as is, I ended up using the Hex version that was provided in the comments from the link you gave above.
Misha M
OK :-) Thanks for the information!
Pascal MARTIN
dotty
+2  A: 

Your Microsoft-encoded quotes are the probably the typographic quotation marks. You can simply replace them with str_replace if you know the encoding of the string in that you want to replace them.

Here’s an example for UTF-8 but using a single mapping array with strtr:

$quotes = array(
    "\xC2\xAB"     => '"', // « (U+00AB) in UTF-8
    "\xC2\xBB"     => '"', // » (U+00BB) in UTF-8
    "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
    "\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8
    "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8
    "\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8
    "\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8
    "\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8
    "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8
    "\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8
    "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8
    "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8
);
$str = strtr($str, $quotes);

If you’re need another encoding, you can use mb_convert_encoding to convert the keys.

Gumbo
Rather than the ugly `\x` escapes, couldn't you simply include the literal characters in your source file?
R..
@R..: That’s the problem: There are many that don’t know enough about character encodings and/or what character encoding they’re using.
Gumbo
+2  A: 

We used the following. Deals with a few more special characters.

$text = str_replace(chr(130), ',', $text);    // baseline single quote
$text = str_replace(chr(132), '"', $text);    // baseline double quote
$text = str_replace(chr(133), '...', $text);  // ellipsis
$text = str_replace(chr(145), "'", $text);    // left single quote
$text = str_replace(chr(146), "'", $text);    // right single quote
$text = str_replace(chr(147), '"', $text);    // left double quote
$text = str_replace(chr(148), '"', $text);    // right double quote

$text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');
ceejayoz