tags:

views:

62

answers:

3
+2  A: 

This is wrong:

    $current = ord($value{$i});
    if (($current == 0x9) ||
        ($current == 0xA) ||
        ($current == 0xD) ||
        (($current >= 0x20) && ($current <= 0xD7FF)) ||
        (($current >= 0xE000) && ($current <= 0xFFFD)) ||
        (($current >= 0x10000) && ($current <= 0x10FFFF)))
    {
        if($current != 0x1F)
            $ret .= chr($current);
    }

ord() never returns anything bigger than 0xFF since it works in a byte-by-byte manner.

I'm guessing your XML is invalid because the file contains an invalid UTF-8 sequence (indeed &#65535;, i.e., 0xFFFF, is invalid in UTF-8). This probably comes from copy-paste of different XML files that have different encodings.

I suggest you use the DOM extension instead to do your XML mash-up, which handles different encodings automatically by converting them internally to UTF-8.

Artefacto
+1 for suggesting DOM
Gordon
Good suggestion - I have inherited some code which generates the Xml as a string, DOM would be a far cleaner way of doing this
Macros
DOM is maybe overkill for producing something like an RSS feed: he probably doesn't need all the manipulation/search facilities, and for big documents the memory footprint of a DOM structure might be excessive
Iacopo
@lacopo Overkill? In what regard? For manipulating XML, DOM is the best lib PHP has. If memory is an issue, there is XMLWriter. In both cases, the result is more reliable than using string concatenation or reinventing everything those libs do already on their own.
Gordon
A: 

You are trying to perform character transcoding. Don't do it by yourself, use the PHP library.

I found iconv quite useful:

$cleanText = iconv('UTF-8','ISO-8859-1//TRANSLIT//IGNORE', $srcText);

This code translates from utf-8 to iso-8859, trying to remap the 'exotic' characters and ignoring the ones that can not be transcoded.

I'm just guessing the source encoding is utf-8. You have to discover which encoding the incoming data is using and translate in the one you are declaring in the XML header.

A linux command line tool that guesses a file's encoding is enca

Iacopo
I tried iconv with all combinations of encoding for both input and output and it didn't work with any
Macros
A: 

I think I was looking down the wrong path - rather than an encoding issue character was an HTML entity representing the 'squared' symbol. As the descriptions in the URL only exist for search enging purposes I can safely remove all htmlentities with the following regex:

$content = preg_replace("/&#?[a-z0-9]+;/i","",$content);
Macros