views:

75

answers:

1

I've got a csv file I'm parsing with PHP. (Actually, it's tab-separated.) In a text editor, the file looks like this:

Object Id   Page/Master Id Page/Master Name ...

Using this code:

$f = file_get_contents($filepath);
echo $f;

I get this in the browser:

��O�b�j�e�c�t� �I�d� �P�a�g�e�/�M�a�s�t�e�r� �I�d� �P�a�g�e�/�M�a�s�t�e�r� �N�a�m�e� ...

with all those question mark characters. If I use strlen() to count the number of chars, it reports twice as many as it should. I suspect it has something to do with unicode, but I'm not sure how to handle it.

Any ideas?

+5  A: 

I may be wrong, but this smells like an UTF-16 encoded file. Can you try

$f = iconv("utf-16", "utf-8", $f);

?

Pekka
The character spacing almost certainly indicates it is a unicode file. utf-16 is a very likely guess too.
Goyuix
In particular, it is UTF-16LE (little-endian) encoding, the UTF-16 variant Windows misleadingly describes as just “Unicode”. The two bytes at the start are a Byte Order Mark that will allow `utf-16`-with-unspecified-endianness to work by automatically detecting the little-endianness.
bobince