views:

468

answers:

3

I have problems displaying the Unicode character of U+009A.

It should look like "š", but instead looks like a rectangular block with the numbers 009A inside.

Converting it to the entity "š" displays the character correctly, but I don't want to store entities in the database.

The encoding of the webpage is in UTF-8.

The character is URL-encoded as "%C2%9A".

Reproduce: # php -E 'echo urldecode("%C2%9A");' > /tmp/test ; less /tmp/test

This gives me <U+009A> in less or <9A> in vim.

A: 

If you’re using UTF-8 as your input encoding, then you can simply use the plain š. Or you could use the hexadecimal representation "\xC2\x9A" (in double quotes) that’s independent from the input encoding. Or utf8_encode("\x9A") since the first 256 characters of Unicode and ISO 8859-1 are identical.

Gumbo
This gives me the same result as before: # php -E 'echo utf8_encode("\x9A"); echo "\n";' > /tmp/test ; less /tmp/test
hovenko
So that means what?
Gumbo
A: 

If I do a hexdump of the output of echo urldecode("%C2%9A"); I get c2 9a, which is the correct UTF-8 encoding for character 0x9a.

You get that same encoding from the output of utf8_encode("\x9A")

When I try to view Unicode char 0x9a, I get a square box too - suspect it's not the char you think it should be (Aha: as Azquelt has posted, unicode character "š" is U+0161, not U+009A)

Paul Dixon
I get "9ac2", but I guess that has to do with big/little endian?I have tested this on Ubuntu i686 and on RedHat4 i686.$ php -E 'echo urldecode("%C2%9A");' > /tmp/test ; od -t x2 /tmp/test0000000 9ac20000002
hovenko
+1  A: 

The Unicode character "š" is U+0161, not U+009A

I suspect that it's 0x9A in another character set.

The box with 009A is usually shown when you don't have a font installed with that character.

Azquelt
+1 for you sir, missed that
Paul Dixon
Maybe that makes more sense. The content was most likely copied from Word, with all the strange things coming with that...This gives me what I need:html_entity_decode("š", ENT_COMPAT, "UTF-8"); But, there are more of these characters, so I would need a mapping or a way to convert between them
hovenko
`š` is 0x9A in Windows-1252.
Gumbo
Thanks for the tip, Gumbo. The following code solved my problem (somehow I needed the utf8_decode first, but I guess the XML feed was UTF-8 encoded as well): $_output = utf8_decode($_output); $_output = mb_convert_encoding($_output, 'UTF-8', 'windows-1252');
hovenko