tags:

views:

760

answers:

3

Got a strange problem in PHP land. Here's a stripped down example:

    $handle = fopen("file.txt", "r");
    while (($line = fgets($handle)) !== FALSE) {
        echo $line;
    }
    fclose($handle);

As an example, if I have a file that looks like this:

Lucien Frégis

Then the above code run from the command line outputs the same name, but instead of an e acute I get :

Lucien FrÚgis

Looking at a hex dump of the file I see that the byte in question is E9, which is what I would expect for e acute in php's default encoding (ISO-8859-1), confirmed by outputting the current value of default_charset.

Any thoughts?

EDIT:

As suggested, I've checked the windows codepage, and apparently its 850, which is obsolete (but does explane why 0xE9 is being displayed the way it is...)

A: 

The accent might be considered unicode data and you will have to store it as such. Take a look at utf_decode, utf_encode, and iconv functions.

No wait, it is in the ISO 8859-1 charset. I don't know. Have you tried reading in binary mode or using file_get_contents?

St. John Johnson
+2  A: 

0xE9 is the encoding for é in iso-8859-1. It's also the unicode codepoint for the same character. If your console interprets output in a different encoding (Such as cp-850), then the same byte will translate to a different codepoint, thus displaying a different character on screen. If you look at the code page for cp-850, you can see that the byte 0xE9 translates to Ú (Unicode codepoint 0xDA). So basically your console interprets the bytes wrongly. I'm not sure how, but you should change the charset of your console to iso-8859-1.

troelskn
A: 

Before running your php on the command line, try executing the command:

chcp 1252

This will change the codepage to one where the accented characters are as you expect.

See the following links for the difference between the 850 and 1252 codepages:

http://en.wikipedia.org/wiki/Code_page_850

http://en.wikipedia.org/wiki/Windows-1252

PHPexperts.ca