tags:

views:

381

answers:

4

I have this character showing up occasionally and I can't seem to find it in the ascii table. I'd like to run a filter on the data before it's sent to the database but I have to know what it is first. Maybe someone can clue me in. I am using a wysiwyg editor and this is where it's coming from. The character appears very sporadicly but seems to appear more often than not when I do two \r or a backspace.

Here is the character

Â

OK, it was suggested that I change the content-type to utf8 in the head of the document but I am still getting these characters in the database. Here is a test after I added the content-type

adf af  aafd a a

aa a  afa a 

adf
+2  A: 

It is a "Latin Capitial A with a Circumflex", HTML code  Unicode U+00C2

Wikipage: http://en.wikipedia.org/wiki/%C3%82

glasnt
Hi TomatoSandwich, thanks for this. It is a help. It looks as though I may an encoding issue though.
+9  A: 

It is highly likely that this character is related to UTF-8 encoding issues. Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is definitely recommended reading in this instance.

Filtering these characters out before sending to the database is almost certainly the wrong thing to do here.

In the case that you mention, you are probably dealing with the character U+00A0, which is the Unicode character for non-break space. The bit pattern for this character is:

1010 0000

After UTF-8 encoding, where the encoded bytes look like

110x xxxx  10xx xxxx

where 'x' represents a bit of the Unicode character value, so U+00A0 is encoded as:

1100 0010  1010 0000

which is 0xC2 0xA0. Coincidentally, the second character is the same byte value as the original character you were encoding (U+00A0), while the first character is the  you are seeing.

Greg Hewgill
You never know. I copied some code off of some of the stackoverflow answers only to find them laces with this character.
David Andres
Thanks Greg. I'm going to the link now.
Greg, these characters are peppered throughout my database. I need to clean them out if I can and then fix the issue.
Hey Greg, are you still here? I am looking over that document and am I correct in saying that all I need is the UTF content type in the head of the page?
That didn't work for me Greg. I'm still getting them in the database. Please have a look above in my OP.
A: 

I am the OP. I am not logged in anymore but I came back to share the solution. The issue was in fact an encoding problem. I added:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

After I did this, I noticed that I was still getting these funky characters in my database. I then changed the encoding on the database table and that did nothing either. That only left the browser... I checked the encoding in the browser and noticed that it was using ISO-8859-1. I changed the encoding on the browser to utf-8 and it is working fine now. :)

Thanks to everyone that contributed.

That is because the browser ignores `<meta>` tags. You need to send proper http headers, using [`header`](http://www.php.net/header)
troelskn
A: 

I think that you are seeing a bug that I once have experienced. ISO-8859-1 is actually a subset of Windows-1152 (I think it's 1152) for Western European languages. The problem is that browsers gladly submits Windows-1152 characters when the web server accepts ISO-8859-1. That means that the browser sends data that is invalid ISO-8859-1. That is what happened with my Windows installation at least. I have seen this behaviour in both IE and Firefox.

I had the problem with a wysiwyg editor where the users would paste data in from a Word document. This document would contain both hyphens and dashes. One of the characters would get submitted fine. The other would be garbage because that character doesn't exist in ISO-8859-1 (I can never remember which is which).

The .net framework that we were using didn't help either as it did not complain about an invalid ISO character when converting to unicode.

Pete