ansaurus

Question

Get non-UTF-8-form fields as UTF-8 in PHP?

Answer 1

A:

You check to see if the characters are within a certain range. If they fall outside the range of standard UTF-8 characters, do whatever you want to with it. I would do this by looking at each character &, #, 8, 5, 9, 4, and parsing it into something you can apply something to.

Short of finding somewhere where someone has created a Windows-1251 to UTF-8 conversion script, you are probably going to have to roll your own. You are probably going to have to look at each specific character and see what needs to be done with it. If it's something like © you will want to handle it differently than → because the second one has the # in it.

I think this answers your question.

contagious 2009-02-12 23:57:25

Answer 2

A:

The html_entity_decode function is probably what you want.

Ant P. 2009-02-13 00:01:17

Answer 3

+1 A:

<form action="action.php" method="get" accept-charset="UTF-8">
    <!-- some elements -->
</form>

All browsers should return the values in the encoding specified in accept-charset.

Georg 2009-02-13 00:07:18

Answer 4

A:

You could set the fourth parameter of the htmlspecialchars function (double_encode, since PHP 5.2.3) to false do avoid the character references being encoded again.

Or you first decode those existing character references.

Gumbo 2009-02-13 00:09:01

Answer 5

A:

You can convert the strings to UTF-8 using the PHP multi-byte functions. From there you can do as you wish. Especially the mb_convert_encoding() to move it from windows-1251 to UTF-8, or where ever.

I don't quite understand your question though, because if someone enters & as a text string, when you do the htmlspecialchars() that should convert it to &amp; ... which when ran back through a html_entity_decode() would come out as the text string the user entered.

This is of course if you haven't used the double_encode option when running your string through the htmlspecialchars()

null 2009-02-13 01:01:41

Answer 6

+3 A:

The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities

Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘Б’ character.

I actually do a htmlspecialchars () on the text before displaying it

Yes. You must do that, or else you've got a security problem.

Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself

Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.

I know that the good idea is to switch the whole software to UTF-8,

Yup. Well, at least the encoding of the page containing the form should be UTF-8.

bobince 2009-02-13 01:05:09

Thank you very much!

Ilya Birman 2009-02-13 08:22:54

Yep .. Even if you're using some inferior string representation internally, do present the page in UTF-8 and make the conversion.

troelskn 2009-02-13 11:58:36

Answer 7

A:

VolkerK 2009-02-13 01:31:20

Answer 8

A:

You won't be able to distinguish between the browser converting a codepoint to an entity and your users typing in an entity because they look identical. The real solution is to give up on Windows 1251. Instead, serve the webpage and form in UTF-8, ask for UTF-8 encoding and all these problems should just go away.

staticsan 2009-02-13 04:11:35

ansaurus

tags:

views:

answers:

Get non-UTF-8-form fields as UTF-8 in PHP?

related questions