tags:

views:

371

answers:

8

EDIT:

Ok I have some data (A ton of data) being pulled from a MySQL DB Table, nothing special about how the data is entered. When parsing the data and re-displaying it to Firefox this symbol � shows up. When I compare it to the DB entry it looks like a space (Nothing special). I'm using all the default PHP/MySQL settings.

Doing a var_dump or print_r is no help either.

Any thoughts?

The Symbol: �

UPDATE:

Ok I did find the character that is causing the problem

Not to be confused with

-

(The Hyphen).

A: 

A really vague question. Somehow, check your website's encoding, your database's data encoding and so.

EDIT: It IS an answer because the flaw is a mismatch between the DB data encoding (probably on utf-8) and the webapp encoding (probably on iso-8859-1). So, the solution goes by either:

1.) backup and Wipe out the DB AND THEN load it with the proper encoding 2.) change the webapp's encoding, so the chars are properly displayed.

Regards,

Alfabravo
Not an answer. You should comment.
Laykes
@edit well over 100000 records and less than 1% are displaying this way. I just want to be able to validate against the symbol and remove is from the string but nothing is working so far
Phill Pafford
you probably won't find it. As states by Gumbo, it is a replacement character used by your browser to point a problem with a char which it was unable to understand. Hence identify a specific data with the error and get to it in the DB. Check the way it is saved
Alfabravo
+1  A: 

It means a character that isn't available in the character set of the current font. You'll need to encode it with an HTML entity, once you've found where it's coming from.

Skilldrick
+1  A: 

That character means there is a codepoint that your browser does not know how to display. Somewhere you're setting a character value to something outside the normal printable character range, and your browser is telling you by displaying the standard 'unknown' character.

The only way to tackle the problem is to find the bug that put the invalid character into your string in the first place.

Billy ONeal
A: 

What are you talking about? Where have you seen this? If its on the rendered page on browser, then you might have saved the file with an improper encoding. Use UTF or unicode encoding while saving the page/source file.

Kangkan
+10  A: 

The character is the REPLACEMENT CHARACTER (U+FFFD). It is used when there was an error within an UTF code:

FFFD � REPLACEMENT CHARACTER

  • used to replace an incoming character whose value is unknown or unrepresentable in Unicode

In most cases it means that some data is interpreted with an UTF encoding while the data is not encoded with that encoding but a different one.

Gumbo
great and thanks for identifying it, but how can I remove it? str_replace and preg_match don't work
Phill Pafford
@Phill Pafford: This character means that you’re having a problem with your character encoding. Fix that and your characters should be displayed properly.
Gumbo
@gumbo well over 100000 records and less than 1% are displaying this way. I just want to be able to validate against the symbol and remove is from the string but nothing is working so far, Ideas?
Phill Pafford
@Phill Pafford: Why don’t you fix the encoding issue? It’s obviously the data or the presentation of that data that’s causing this behavior.
Gumbo
Phill, you don't seem to understand that this character DOESN'T EXIST in the database or in the output sent to the browser. It's a symbol the browser is putting in because the encoding is invalid. Check that your database and your output are in the same encoding.
TRiG
Thanks a ton, Looks like its the emdash from a copy paste from MS Office to the text field where the user inserts the text.
Phill Pafford
+1  A: 

This is a common problem when pasting text from microsoft office products to html, or into a database. The largest offenders seem to be the emdash(as you found) and smart quotes. One solution I have found when users insist upon using a text editor that is non-compliant like that is to have them paste it into something like notepad first, to strip the proprietary symbols.

Obviously the best solution is to simply not use word for textual data that is intended for web display.

Added just to provide some info to future readers.

Regards, Jc

JC
Thanks this explains alot
Phill Pafford
+1  A: 

You can look into iconv() and mb_* functions if you're just trying to sanitize the data.

The most likely cause as observed elsewhere is that you've got a problem with character encodings. PHP is not very good at dealing with character encodings until version 6 (dealing with byte arrays and leaving encoding issues more or less up to the developer to deal with).

Make sure you're displaying the page in the same character encoding as your database, and make sure that you convert all user input into that same character encoding (iconv() and mb_detect_encoding() will help) before sticking it in the database.

MightyE
Thanks this is interesting and will have to try this
Phill Pafford
A: 

Why not try a regex in javascript against what Gumbo identified as "... character � ... the REPLACEMENT CHARACTER (U+FFFD)" after rendering the webpage - this way you will not have to mess with the DB (which you seem very reluctant to do) and whatever minor performance penalty is offloaded to the client side.

hjhndr