views:

66

answers:

2

Hello,

I am attempting to convert a ISO8859-1 string taken from a MySQL database and convert it to UTF-8 using php. However, when I use the utf8_encode function it removes almost all of the apostrophes from the string (the exceptions seem to be within html fields).

Thanks

A: 

One possibility is to use Iconv. I have used it before and it is quite good.

http://php.net/manual/en/function.iconv.php

It has a TRANSLIT option which can approximate the character.

動靜能量
+3  A: 

Your ‘ISO-8859-1’ content is probably not actually ISO-8859-1.

When you say Content-Type: text/html; charset=iso-8859-1, browsers don't actually use ISO-8859-1, for annoying historical reasons. They really use Windows code page 1252 (Western European), which is very similar to ISO-8859-1, but not the same.

In particular, the bytes in the range 0x80-0x9F represent invisible and seldom-used control codes in ISO-8859-1. But cp1252 adds some typographical niceties and other extensions in this range, including the ‘smart quotes’. When you write an apostrophe in MS Word, it changes it to a single left-facing smart-quote , so it's common to have encoding problems with text that was original typed in Word and other Office apps.

To convert cp1252 to UTF-8 you would have to use iconv('cp1252', 'utf-8', $somestring) rather than utf8_encode which is tied to ‘real’ ISO-8859-1.

bobince
I think it's fairer to say that browsers don't always use ISO-8859-1 (aka Latin-1). And if not, they don't necessarily use Windows code pages, esp. on non-Windows platforms.
StaxMan
Thanks, this worked.
@StaxMan: In the early days of the web, you're right, there was a mixture of incompatible behaviour. But today, current browsers all use cp1252 when ISO-8859-1 is specified. HTML5 [standardises](http://dev.w3.org/html5/spec/Overview.html#character-encodings-0) this and other nasty encoding substitutions. It's a shame that this ugly behaviour has become standard, and there's no way to specify “ISO-8859-1 and I mean it!”... but then we're all using UTF-8 so who cares, right? :-)
bobince
Oh? Seems like I learnt something new today then... that is very interesting (and yes, very messy!). Agreed on UTF-8 (yeah I know, not everyone is using it), I actually like that JSON decided that it's UTF-xx and nothing else.
StaxMan