ansaurus

Question

Answer 1

+2 A:

I suspect you mean an Em Dash (—). ISO-8859-1 doesn't include this character, so you aren't going to have much luck converting it to that encoding.

You could use htmlentities(), but I'd suggest moving off ISO-8859-1 to UTF-8 for publication.

David Dorward 2009-10-14 15:33:38

`htmlentities($post_title, ENT_COMPAT, 'utf-8');` should give you an equivalent HTML-string that will be shown correctly in the browser. As it may contain entities (and does in your example) you should not use the string for generating plain-text emails for example.

Stefan Gehrig 2009-10-14 15:45:50

Gumbo 2009-10-14 17:36:17

@Gumbo That's my point.

David Dorward 2009-10-14 19:26:12

Answer 2

+1 A:

It's probably an em dash (U+2014), and what you're trying to do isn't converting the encoding, because the hyphen is a different character. In other words, you want to search for such characters and replace them manually.

Better yet, just switch the website to UTF-8. It largely coincides with Latin-1 and is more appropriate for a website in 2009.

Reinis I. 2009-10-14 15:43:57

*sigh* I totally agree about switching to UTF-8, but the current encoding is a legacy that I've got to tolerate for now... :\

MatW 2009-10-14 15:56:33

To be pedantic, only half of Latin-1 has parity with UTF-8 - which is the ASCII half of Latin-1 i.e., any character that fits in 7 bits. The Latin-1 range from 0x80-0xFF is invalid in UTF-8. But I agree that converting to UTF-8 is a good idea if possible.

Peter Bailey 2009-10-14 17:00:29

Answer 3

+2 A:

mb_convert_encoding only converts the internal encoding - it won't actually change the byte sequences for characters from one character set to another. For that you need iconv.

mb_internal_encoding( 'UTF-8' );
ini_set( 'default_charset', 'ISO-8859-1' );

$post_title = 'Blogging — does it pay the bills?'; // I used the actual m-dash here to best mimic your scenario

echo iconv( 'UTF-8', 'ISO-8859-1//TRANSLIT', $post_title );

Or, as others have said, just convert out-of-range characters to html entities.

Peter Bailey 2009-10-14 15:58:55

Answer 4

+2 A:

I suppose the following:

Your file is actually encoded with UTF-8
Your editor interprets the file with Windows-1252

The reason for that is that your EM DASH character (U+2014) is represented by â€“. That’s exactly what you get when you interpret the UTF-8 code word of that character (0xE28094) with Windows-1252 (0xE2=â, 0x80=€, 0x94=”). So you first need to fix your editor encoding.

And the reason for the ? in your output is that ISO 8859-1 doesn’t contain the EM DASH character.

Gumbo 2009-10-14 16:06:00

+1 for a great explanation of byte-sequences vs characters.

Peter Bailey 2009-10-14 17:15:45

ansaurus

tags:

views:

answers:

utf-8 to iso-8859-1 encoding problem

related questions