views:

47

answers:

2

So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.

The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another £!

So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as &pound manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?

Thanks for any help.

+1  A: 

This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.

Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.

A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/

Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.

Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.

maxwellb
Is there a way to tell PHP's urlencode (or far better, rawurlencode, since urlencode is outdated) to use different a different ANSI encoding? I didnt see a way on the manual page for either function.
Stephen
+2  A: 

The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.

Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:

iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)
Artefacto
So, which is the better method - to change the form's characterset - note that the attribute I found on w3schools was accept-charset, not charset - or to use iconv in the code? I read that IE apparently doesnt work properly with accept-charset, so is it better to convert server-side from UTF?
Stephen
@Stephen You're right, it's "accept-charset". I'd say it would be better to do it server-side, because the standard doesn't guarantee you won't get UTF-8 anyway. Better yet, use UTF-8 all the time, including to store data in the database. IMO, all new web applications ought to go in that direction.
Artefacto
@Stephen Note that also despite this being the standard, there are some implementation issues, in particular, some browsers use the encoding of the page to determine the encoding of the submission, despite the presence of "accept-charset". See http://stackoverflow.com/questions/153527
Artefacto