views:

767

answers:

6

I get text as user input and somewhere in the text there are no standard characters, like this.

The text is stored to a database. Everything is in UTF-8 and it works well, only it displays strange signs for the non standard characters.

How can I filter these characters in PHP?


Edit:

I discovered that the text with the wrong characters are "correctly" stored in the database. When the text will be shown on a static UTF-8 encoded HTML page, the broken characters will be shown. But when the text is loaded via AJAX, it crashes and the loading operation fails. So I think this is still an AJAX encoding problem.

A: 

These “strange characters” may originate either from a wrong character encoding (the user input is not UTF-8 encoded) or from missing glyphs in the font used to represent those characters.

So you should first find the real cause of these “strange characters”.

Gumbo
There are functions that can determine if a byte sequence is valid UTF-8. See for example http://docs.php.net/mb_detect_encoding or the `is_utf8` function in the comments or the regular expression described in http://www.w3.org/International/questions/qa-forms-utf-8
Gumbo
I've edit my question, so I think I located the problem. Thanks.
+1  A: 

Unicode (and encodings like UTF-8) contain way more characters than most operating systems can display, simply because a typical user doesn't have need for every possible character available.

This probably means that one of your users has input characters that they have on their system, but you don't have on yours; UTF-8 doesn't care what you can see, merely what it needs to store. As an example, if someone has a Hungarian name like Paul Erdős (note the funky slashes over the O), that character might not be available on all systems.

So as another answer says, you might need to track down where those symbols are coming from in order to see if your clients really need to display them, or you need to translate them to something else.

Jeremy Smyth
A: 

Use the function:

$htmlEntitiesString = htmlentities($inputString);

It will turn all characters like é í ä and so on into HTML entities, ensuring you don't get problems like 'é' turning into 'Á@' or something like that.

fmsf
A: 

I use a set of functions in PHP to check, convert and mangle characters to UTF-8. I got these functions from somewhere on the net a long time ago so sadly can't take any credit for them, but hope they help.

PHP functions for converting characters around about UTF-8

Gav
A: 

What "kind" of AJAX do you use, and with which library if any? Do you load XML files or HTML files for displaying or only simple strings for div.innerHTML = myRequestetContent?

If you use XML, then you might experience different problems here: no charset in the XML declaration (therefore the wrong characters) and non escaped xml entities like & or < or > that could make your XML invalid and therefore can break the AJAX functions.

The former can be fixed by adding the right character encoding to the declaration in the xml file like <?xml version="1.0" encoding="UTF-8">, the latter by htmlspecialcharacters in PHP.

Residuum
A: 

You should definitely consider changing your AJAX response page to returning the data as an XML formatted result using CDATA. Then I am pretty sure you are home safe.

If you are unsure about what CDATA is, then take a look here: http://en.wikipedia.org/wiki/Cdata

Take a look at this for examples using PHP's XMLWriter object: http://php.net/xmlwriter_write_cdata

Preben