views:

58

answers:

4

I have some HTML data (over which I have no control, can only read it) that contains a lot of Scandinavian characters (å, ä, ö, æ, ø, etc.). These "special" chars are stored as HTML character numbers (æ = æ). I need to convert these to the corresponding actual character in PHP (or JavaScript but I guess PHP is better here...). Seems like html_entity_decode() only handles the "other" kind of entities, where æ = &#aelig;. The only solution I've come up with so far is to make a conversion table and map each character number to a real character, but that's not really super smart... So, any ideas? ;)

Cheers, Christofer

+3  A: 
&#NUMBER;

refers to the unicode value of that char.

so you could use some regex like:

/&#(\d+);/g

to grab the numbers, I don't know PHP but im sure you can google how to turn a number into its unicode equivalent char.

Then simply replace your regex match with the char.

Edit: Actually it looks like you can use this:

mb_convert_encoding('æ', 'UTF-8', 'HTML-ENTITIES');
Andrew Bullock
+1 for mb_convert_encoding.
Artefacto
A: 

On the PHP manual page on html_entity_decode(), it gives the following code for decoding numeric entities in versions of PHP prior to 4.3.0:

  $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
  $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);

As someone noted in the comments, you should probably replace chr() with unichr() to deal with non-ASCII characters.

However, it looks like html_entity_decode() really should deal with numeric as well as literal entities. Are specifying an appropriate charset (e.g.,UTF-8)?

ngroot
Yup it works! Seems I didn't read the manual thoroughly enough :P thanks!
cpak
A: 

I think html_entity_decode() should work just fine. What happens when you try:

echo html_entity_decode('æ', ENT_COMPAT, 'UTF-8');
Matt Gibson
Yup it works! Seems I didn't read the manual thoroughly enough :P thanks!
cpak
A: 

If you haven't got the luxury of having multibyte string functions installed, you can use something like this:

<?php

    $string = 'Here is a special char &#230;';

    $list = preg_replace_callback('/(&#([0-9]+);)/', create_function(
        '$matches', 'return decode(array($matches[2]));'
    ), $string);

    echo '<p>', $string, '</p>';
    echo '<p>', $list, '</p>';

    function decode(array $list)
    {
        foreach ($list as $key=>$value) {
            return utf8_encode(chr($value));
        }
    }


?>
Nev Stokes