views:

37

answers:

1

Similar questions:
Some characters in CSV file are not read during PHP fgetcsv() ,
fgetcsv() ignores special characters when they are at the beginning of line

My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:

$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
    print_r($row);
    // further code not relevant as the data is already corrupt at this point
}

For reasons I cannot change, the users are uploading the file encoded in the Windows-1250 charset. The problem: Win1250 is a single-byte, 8-bit character encoding, and some (not all!) characters over 127 are dropped. Example data:

"15","Ústav"
"420","Špičák"
"7","Tmaň"

becomes

Array (
  0 => 15
  1 => "stav"
)
Array (
  0 => 420
  1 => "pičák"
)
Array (
  0 => 7
  1 => "Tma"
)

Now the documentation for fgetcsv says that since "4.3.5 fgetcsv() is now binary safe", but apparently it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?

+2  A: 

It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:

Note:

Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function

In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG).

I've sidestepped the issue by reading the lines with fgets (which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:

$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle, 0)) { // fgets is actually binary safe
    $row = csvstring_to_array($raw_row, ',', '"', "\n");
    // $row is now read correctly
}
Piskvor
+1 nice catch !
Mike B