views:

99

answers:

5

I want to make sure a CSV file uploaded by one of our clients is really a CSV file in PHP. I'm handling the upload itself just fine. I'm not worried about malicious users, but I am worried about the ones that will try to upload Excel workbooks instead. Unless I'm mistaken, an Excel workbook and a CSV can still have the same MIME, so checking that isn't good enough.

Is there one regular expression that can handle verifying a CSV file is really a CSV file? (I don't need parsing... that's what PHP's fgetcsv() is for.) I've seen several, but they are usually followed by comments like "it didn't work for case X."

Is there some other better way of handling this?

(I expect the CSV to hold first/last names, department names... nothing fancy.)

+2  A: 

You can write a RE that will give you a guess if the file is valid CSV or not - but perhaps a better approach would be to try and parse the file as if it was CSV (with your fgetcsv() call), and assume it's NOT a valid one if the call fails?

In other words, the best way to see if the file is a valid CSV file is to try and parse it as such, and assume that if you failed to parse, it wasn't a CSV!

zigdon
I am checking to see if fgetcsv() will return false on an Excel workbook when read in, but it won't -- not until EOF at least.
Guttsy
Writing a RE that rigorously handles even a single line of CSV is tricky, doubly so if fields are allowed to spill over multiple lines. The conclusion is correct, though - parse the CSV to ensure it is CSV.
Jonathan Leffler
+1  A: 

The easiest way is to try parsing the CSV and attempting to read value from it. Parse it using str_getcsv and then attempt to read a value from it. If you are able to read and validate at least a couple of values, then the CSV is valid.

EDIT

If you don't have access to str_getcsv, use this, a drop-in replacement for str_getcsv from http://www.electrictoolbox.com/php-str-getcsv-function/:

if (!function_exists('str_getcsv')) {
    function str_getcsv($input, $delimiter = ",", $enclosure = '"', $escape = "\\") {
        $fp = fopen("php://memory", 'r+');
        fputs($fp, $input);
        rewind($fp);
        $data = fgetcsv($fp, null, $delimiter, $enclosure); // $escape only got added in 5.3.0
        fclose($fp);
        return $data;
    }
}
SimpleCoder
I'm fortunate enough to run PHP on IIS using the Web Platform Installer and only have version 5.2.something, not 5.3. I failed to mention that. That doesn't stop me from using fgetcsv() with a file handler one line at a time though.
Guttsy
I ran into this same problem.. see my edited post for the solution.
SimpleCoder
+2  A: 

Unlike other file formats, CSV has no tell-tale bytes in the file header. It starts straight away with the actual data.

I don't see any way except to actually parse it, and to count whether there is the expected number of columns in the result.

It may be enough to read as many characters as are needed to determine the first line (= until the first line break).

Pekka
This. Nearly all text is "valid CSV" of some form or another. To tell whether it's *meaningful* the best you can do is look for the right number of fields, correct headers, etc. Which means parsing.
hobbs
Erm, I guess when I said "parsing" I was thinking I wasn't going to rely on the regex to place anything that matched into variables.
Guttsy
@Guttsy no regular expression, anywhere, at all.
hobbs
A: 

Technically speaking, almost any text file could be a CSV file (barring quotes that don't match, etc.). You can try to guess if it's a binary file, but there isn't a reliable way to do that unless your data only has ASCII or something of the sort. If all you care is that people don't upload Excel files by mistake, check the file extension.

Nelson
Obvious solution is obvious... I like it. I'll just check the extension since I doubt any of our clients will bother cheating.
Guttsy
A: 

Any text file is a valid CSV file so it is impossible to come up with a standard way of verifying its correctness because it depends on what you really expect it to be.

Before you even start, you have to know what delimiter is used in that CSV file. After that, the easiest way to verify is to use fgetcsv function. For example:

<?php
$row = 1;
if (($handle = fopen("test.csv", "r")) !== FALSE) {
    while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
        $num = count($data); // Number of fields in a row.
        if ($num !== 5)
        {
            // OMG! Column count is not five!
        }
        else if (intval($data[$c]) == 0)
        {
            // OMG! Customer thinks we sold a car for $0!
        }
    }
    fclose($handle);
}
?>
Vlad Lazarenko