views:

2405

answers:

5

I'm writing a php program that pulls from a db source. Some of the varchars have quotes that are displaying as black diamonds with a question mark in them (unkown charecters, I assume from microsoft word text).

How can I use php to strip these charecters out?

+2  A: 

That can be caused unicode or other charset mismatch. Try changing charset in your browser, in of the settings the text will look OK. Then it's question of how to convert your database contents to charset you use for displaying. (Which can actually be just adding utf-8 charset statement to your output.)

che
+3  A: 

If you see that character (� U+FFFD "REPLACEMENT CHARACTER") it usually means that the text itself is encoded in some form of single byte encoding but interpreted in one of the unicode encodings (UTF8 or UTF16).

If it were the other way around it would (usually) look something like this: ä.

Probably the original encoding is ISO-8859-1, also known as Latin-1. You can check this without having to change your script: Browsers give you the option to re-interpret a page in a different encoding -- in Firefox use "View" -> "Character Encoding".

To make the browser use the correct encoding, add an HTTP header like this:

header("Content-Type: text/plain; charset=ISO-8859-1");

or put the encoding in a meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Alternatively you could try to read from the database in another encoding (UTF-8, preferably) or convert the text with iconv().

hop
So far this is the closest solution. However, now I have a meta: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">and I'm using iconv to convert from iso-8859-1 to utf-8, the charecters now show as a box with 0096 and 0092 respectivley special(' or -) any other thoughts?
yes, i have another thought: do some homework...you probably used the wrong source encoding. 0x92 and 0x96 are "curved single quote" and "dash" in windows-1252. could that be the right one? have you tried the browser-trick?
hop
+8  A: 

This is a charset issue. As such, it can have gone wrong on many different levels, but most likely, the strings in your database are utf-8 encoded, and you are presenting them as iso-8859-1. Or the other way around.

The proper way to fix this problem, is to get your character-sets straight. The simplest strategy, since you're using PHP, is to use iso-8859-1 throughout your application. To do this, you must ensure that:

  • All PHP source-files are saved as iso-8859-1 (Not to be confused with cp-1252).
  • Your web-server is configured to serve files with charset=iso-8859-1
  • Alternatively, you can override the webservers settings from within the PHP-document, using header.
  • In addition, you may insert a meta-tag in you HTML, that specifies the same thing, but this isn't strictly needed.
  • You may also specify the accept-charset attribute on your <form> elements.
  • Database tables are defined with encoding as latin1
  • The database connection between PHP to and database is set to latin1

If you already have data in your database, you should be aware that they are probably messed up already. If you are not already in production phase, just wipe it all and start over. Otherwise you'll have to do some data cleanup.

A note on meta-tags, since everybody misunderstands what they are:

When a web-server serves a file (A HTML-document), it sends some information, that isn't presented directly in the browser. This is known as HTTP-headers. One such header, is the Content-Type header, which specifies the mimetype of the file (Eg. text/html) as well as the encoding (aka charset). While most webservers will send a Content-Type header with charset info, it's optional. If it isn't present, the browser will instead interpret any meta-tags with http-equiv="Content-Type". It's important to realise that the meta-tag is only interpreted if the webserver doesn't send the header. In practice this means that it's only used if the page is saved to disk and then opened from there.

This page has a very good explanation of these things.

troelskn
Cheers, this worked for me.
Frederico
A: 

Based on your description of the problem, the data in your database is almost certainly encoded as Windows-1252, and your page is almost certainly being served as ISO-8859-1. These two character sets are equivalent except that Windows-1252 has 16 extra characters which are not present in ISO-8859-1, including left and right curly quotes.

Assuming my analysis is correct, the simplest solution is to serve your page as Windows-1252. This will work because all characters that are in ISO-8859-1 are also in Windows-1252. In PHP you can change the encoding as follows:

header('Content-Type: text/html; charset=Windows-1252');

However, you really should check what character encoding you are using in your HTML files and the contents of your database, and take care to be consistent, or convert properly where this is not possible.

Daniel Cassidy
The problem with this suggestion is that most likely the data is a mix of different charsets at this point. If you don't know exactly what went wrong, it just becomes even messier, if you just throw some random fixes in here and there.
troelskn
I agree. I edited my post somewhat to reflect that this solution isn't a substitute for knowing what you're doing.However, I've come to the conclusion that most developers are either incapable of understanding this issue, or just don't care. It seems to come up at least once a month where I work.
Daniel Cassidy
That's pretty much my observation too. For what I care, they reap as they sow. But you're probably right; Chances are that his data is indeed cp-1252 .. At least some of it is.
troelskn
A: 

You can also change the caracter set in your browser. Just for debug reasons.

powtac