views:

51

answers:

2

I have a MySQL database table with a collation of 'utf8_general_ci' and the value in the field is:

x & #299; bán yá wén (without the spaces).

When this is converted (for example by StackOverflow's editor) it looks like this:

xī bán yá wén

where the second character looks like a lower case i with a bar over the top.

In PHP, what function converts the & #299 ; entity into the ī character?

I've tried using html_entity_decode($str,ENT_COMPAT,'UTF-8'), however I get characters like the following:

yÄ«n wén or zhÅ•ng wén

I'm pretty sure there's something I don't understand about the decoding, which is why I'm using the wrong function. Can anyone shed some light on how to get the single character glyph that's represented by the entity & #299 and similar high-number characters above 255?

Many thanks, AE

+1  A: 

UTF-8 is a multibyte encoding. As such if you look at it through a single-byte encoding such as Latin-1 you'll see something much like the results you're seeing. Set the document encoding to UTF-8 to see the actual character.

As for your first question, it's actually the browser that's decoding the character reference and printing the character, not PHP.

Ignacio Vazquez-Abrams
Hi Ignacio. Thank you for responding so quickly. The output is being loaded into a PDF. Up until this point, accented characters have been added to the database directly and have come out fine, but the entity above was added for Chinese. If I use mb_convert_encoding($str,"ISO-8859-1", "UTF-8"), the output and input are the same. I'm not sure if it's something that is due to the conversion before going into the PDF or how the PDF represents the characters. If you had that entity alone - how would you convert it into the character/glyph?Many thanks again =D
AE
@AE Sounds like the database encoding might have been switched (hopefully just locally) to a different encoding (possibly latin_1?) and you have lost the proper characters.
SeanJA
Unfortunately I have no experience working with character sets in PDF files, but the only way the input and output for that operation could be the same is if all characters were below 128, or if something went horribly, horribly wrong with charset declarations.
Ignacio Vazquez-Abrams
If the database is utf8 tables are utf8, the page is using the utf8 charset... you shouldn't have to convert the characters from whatever they are to utf8.
SeanJA
@SeanJA - I think the client actually added the characters in that way because the table and field are both set to 'utf8_general_ci' still? I think that the output would be the same because if I entered each character '' and use mb_convert_encoding() it's the same as running it on the string without spaces. I think all the actual characters in the field are less than 128 but the glyph we want it to be (ī) is above 128, which is what I'm not sure how to get. I think the PDF will output a character in the Font if it exists; we get the input str again though
AE
@SeanJA - The full example is "xī bán yá wén" in the field, which is a mix of accented characters and encoded entities. If it was a utf8 encoded character, I could have a go at decoding it, but I'm trying to get back to the original character that this represents. html_entity_decode() would work for some, but this one's outside the range. I'm not sure how to get into the right charset to get a glyph for it I think.@Ignacio - part of the previous comment was a reply for your comment - my apologies :)
AE
Example:$str="mèng jiā lāwén";becomes"mng jiā lāwn"usingmb_convert_encoding($str,"ISO-8859-1", "UTF-8")
AE
Doing more research I think I can sum up the problem with a better example. On http://www.public.asu.edu/~rjansen/glyph_encoding.html, the client inputs the character string on the left and we want to get out the character on the right for numbers above 128 and in a PDF generated through FPDF using the Arial font (if that makes any difference?).
AE
A: 

I suggest you read through this page: Unicode for the working PHP programmer. It is not long and it should get you over the hump and into confident Unicode and UTF-8.

Once you're OK with that stuff, check out the mbstring and intl PHP extensions, which are very handy. And know which string functions in PHP are and are not safe to use on multibyte strings. Here's the notes I made when I was transitioning a site to UTF-8 which includes a list of naughty string functions.

fsb
Hi. Thank you for replying - the characters that I'm trying to convert are represented like where x is a number, so mbstring is taking in a non-mb string. Most of my google searches give me results for the opposite: having a utf8 'character' and getting the code for it or converting 'it' into latin1 or from latin1 to utf8. Given the 'code' for the character, how do we get the glyph? The site is already in utf8, so the glyphs show in a webpage because the browser converts it with the right charset, but for PDFs generated via fpdf, you can't send a UTF header, so see the 'code' string
AE