views:

5422

answers:

10

I have a form with a textarea. Users enter a block of text which is stored in a database.

Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€

What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?

I am working in PHP.

Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html

Some notes on my environment:

The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.

On those pages the smart quotes and emdashes appear as a diamond with question mark.

Solution:

Thanks again for the responses. The solution was twofold:

  1. Make sure the database and html files were explicitly set to use UTF-8 encoding.
  2. Use htmlspecialchars() instead of htmlentities().
+5  A: 

This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html

theraccoonbear
+1  A: 

We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.

mspmsp
+2  A: 

It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.

Here is some info on migrating your database to another character encoding, at least for a MySQL database.

Kip
A: 

In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"

Patrick McElhaney
A: 

You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.

$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');

This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.

Greg
A: 

This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.

What we do is force the text through iconv

// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);

The //IGNORE flag means that anything that can't be translated will be thrown away.

If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.

ConroyP
This seems like such a perfect "quick fix" but sadly it wound up making my test case significantly worse by adding *more* invalid characters.
Nicholas Kreidberg
+2  A: 

The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.

The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type header of "text/html;charset=utf-8" or add <meta> tags to your HTMLs:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>

That way, the content type of the data submitted to PHP will also be the same.

I had a similar issue and adding the <meta> tag worked for me.

Ates Goral
It worked for me as well, thank you very much :D
Bruno De Barros
A: 

This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "&#8220;"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.

The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.

Domenic
A: 

You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).

In practice this means running a query SET NAMES 'utf8';

http://www.phpwact.org/php/i18n/utf-8/mysql

Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.

Joeri Sebrechts
A: 

the problem is on the mysql charset, I fixed my issues with this line of code.

mysql_set_charset('utf8',$link); 
hawshy