views:

386

answers:

2

I have a French site that I want to parse, but am running into problems converting the (uft-8) html to latin-1.

The problem is shown in the following phpunit test case:

class Test extends PHPUnit_Framework_TestCase {

    private static function fromHTML($str){
     return html_entity_decode($str, ENT_QUOTES, 'UTF-8');
    }

    public function test1(){

     //REMOVE THE SPACE between the '&' and 'nbsp'. SO won't
     //let me write it without the space
     $strFrom  = 'Wanted& nbsp;: les Chasseurs de Tamriel';
     $strTo  = 'Wanted : les Chasseurs de Tamriel';
     $strFrom = self::fromHTML($strFrom);
     $this->assertEquals($strTo, $strFrom);
    }

    public function test2(){
     $strFrom  = 'Remplacement d’Almalexia';
     $strTo   = 'Remplacement d’Almalexia';
     $strFrom = self::fromHTML($strFrom);
     $this->assertEquals($strTo, $strFrom);
    }

    }

test2 completes fine. test1 seems to fail as the space isn't correct, so when converted to ascii it ends up as a unknown character (�).

How would I ensure both tests pass?

+2  A: 

Just as a small suggestion, make sure that your .php file encoding is set to utf8, you don't know how many people miss that.

PERR0_HUNTER
+2  A: 

test1 does not fail, its answer is correct. The strings you compare are not the same. “ ” is not decoded to a space (0x20). It’s a non-breaking space character and as such gets decoded to 0xa0. When you change strTo to contain that character before the colon the assertEquals will return true. Of course you have to make sure that your file is saved with the UTF-8 encoding, just as PERR0_HUNTER mentioned but seeing that you use the “’” character you are probably already doing that. :)

Bombe