views:

1074

answers:

8

I have a form served in non-UTF-8 (it’s actually in Windows-1251). People, of course, post there any characters they like to. The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities so I can still recognise them. For example, if user types an →, I receive an →. That’s partially great, like, if I just echo it back, the browser will correctly display the → no matter what.

The problem is, I actually do a htmlspecialchars () on the text before displaying it (it’s a PHP function to convert special characters to HTML entities, e.g. & becomes &). My users sometimes type things like — or ©, and I want to display them as actual — or ©, not — and ©.

There’s no way for me to distinguish an → from →, because I get them both as →. And, since I htmlspecialchars () the text, and I also get a → for a → from browser, I echo back an → which gets displayed as → in a browser. So the user’s input gets corrupted.

Is there a way to say: “Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself”?

Oh, I know that the good idea is to switch the whole software to UTF-8, but that is just too much work, and I would be happy to get a quick fix for this. If this matters, the form’s enctype is "multipart/form-data" (includes file uploader, so cannot use any other enctype). I use Apache and PHP.

Thanks!

A: 

You check to see if the characters are within a certain range. If they fall outside the range of standard UTF-8 characters, do whatever you want to with it. I would do this by looking at each character &, #, 8, 5, 9, 4, and parsing it into something you can apply something to.

Short of finding somewhere where someone has created a Windows-1251 to UTF-8 conversion script, you are probably going to have to roll your own. You are probably going to have to look at each specific character and see what needs to be done with it. If it's something like © you will want to handle it differently than → because the second one has the # in it.

I think this answers your question.

contagious
A: 

The html_entity_decode function is probably what you want.

Ant P.
+1  A: 
<form action="action.php" method="get" accept-charset="UTF-8">
    <!-- some elements -->
</form>

All browsers should return the values in the encoding specified in accept-charset.

Georg
A: 

You could set the fourth parameter of the htmlspecialchars function (double_encode, since PHP 5.2.3) to false do avoid the character references being encoded again.

Or you first decode those existing character references.

Gumbo
A: 

You can convert the strings to UTF-8 using the PHP multi-byte functions. From there you can do as you wish. Especially the mb_convert_encoding() to move it from windows-1251 to UTF-8, or where ever.

I don't quite understand your question though, because if someone enters &amp; as a text string, when you do the htmlspecialchars() that should convert it to &amp;amp; ... which when ran back through a html_entity_decode() would come out as the text string the user entered.

This is of course if you haven't used the double_encode option when running your string through the htmlspecialchars()

null
+3  A: 

The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities

Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “&#411;” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘Б’ character.

I actually do a htmlspecialchars () on the text before displaying it

Yes. You must do that, or else you've got a security problem.

Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself

Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.

I know that the good idea is to switch the whole software to UTF-8,

Yup. Well, at least the encoding of the page containing the form should be UTF-8.

bobince
Thank you very much!
Ilya Birman
Yep .. Even if you're using some inferior string representation internally, do present the page in UTF-8 and make the conversion.
troelskn
A: 
VolkerK
A: 

You won't be able to distinguish between the browser converting a codepoint to an entity and your users typing in an entity because they look identical. The real solution is to give up on Windows 1251. Instead, serve the webpage and form in UTF-8, ask for UTF-8 encoding and all these problems should just go away.

staticsan