tags:

views:

48

answers:

2

Hi Guys,

I am parsing text/html from web pages into an xml feed, the text/html is encoded iso-8895-1 while the XML feed must be UTF-8. I have used html entities, but am having to manually replace loads of characters, here is what I have so far (still not parsing all text)

$desc = str_replace(array("\n", "\r", "\r\n"),"",$desc);
    $desc = str_replace(array("’","‘","”","“"),"'",$desc);
  $desc = str_replace("£","£",$desc);
    $desc = str_replace("é","é",$desc);
    $desc = str_replace("²","2",$desc);
    $desc = str_replace(array("-","•"),"‐",$desc);
$desc = htmlentities($desc, ENT_QUOTES, "UTF-8");
+5  A: 

Use iconv(). It will allow you to use native characters in UTF-8 as well - no need for HTML entities.

$data = iconv("ISO-8859-1", "UTF-8", $text);

when doing encoding from UTF-8 to another character set, use IGNORE or TRANSLIT to drop or transliterate non-translatable characters.

alternatively, the mb_* functions shown by @Gumbo will work as well.

Pekka
+1, possibly add `//TRANSLIT` to prevent characters that can't be represented in ISO-8859-1 to break the string.
Wrikken
@Wrikken good point, added.
Pekka
Um, the character set of the ISO 8859-1 is a subset of the Unicode character set. So there is no need to ignore or transliterate anything because there is no difference: charset(ISO 8859-1) \ charset(Unicode) = ∅.
Gumbo
@Gumbo of course, I wasn't thinking. Fixed, cheers
Pekka
Don't forget to *also* modify any `META` tag that gives the charset, since it will probably be inaccurate afterwards.
Ignacio Vazquez-Abrams
@Ignacio Vazquez-Abrams: An XML feed probably doesn’t have a `META` element – at least not those I know of.
Gumbo
Brilliant thanks Pekka
Liam Bailey
+1  A: 

You can also use utf8_encode or mb_convert_encoding:

$desc = utf8_encode($desc);
// OR
$desc = mb_convert_encoding($dest, 'UTF-8', 'ISO-8859-1');

Both will convert the encoding from ISO 8859-1 to UTF-8.

Gumbo