tags:

views:

947

answers:

6

How can we validate a CSV file ?

I have an CSV file of structure:

Date;Id;Shown
15-Mar-10;231;345
15-Mar-10;232;346
and so on and on !!! approx around 80,000 rows. 

How can I validate this CSV file before starting the parsing using fgetcsv ?

A: 

You don't need to validate the input, it will return FALSE if an error occurs.

Matthias Vance
+1  A: 

You don't want to be validating it if you're going to be reading it right after. Just read it in and catch any errors as you read.

Ignacio Vazquez-Abrams
So what would be the validation logic to check if there is any errors in the file ?
Rachel
If `fgetcsv()` returns something you don't expect then the file has an error.
Ignacio Vazquez-Abrams
A: 

You could used a regular expression to find rows that match (and therefore flag the ones that don't). Have a look at this link. That being said, you'll need to read through the whole file in order to validate it so you're probably better off just trying to parse it the first time through and catching any errors.

TLiebe
+1  A: 

I would not try to validate the file before hand : I would rather prefer going through it line by line, dealing with each line separately :

  • Reading one line
  • Verifying it's OK
  • using the data
  • and going to next line.


Now, what could "verify it's OK" means ?

  • At least : make sure I can read the line as CSV, with my normal set of functions (maybe fgetcsv, maybe some other function specific to my project -- anyway, if I cannot read one line with my function that reads hundreds, it's probably because there's a problem on that line)
  • Then, check for the number of fields
  • then, for each field, check if it contains "valid" data
    • mandatory ? optionnal ?
    • numeric ?
    • string ?
    • date ?
    • and so on
  • then, for each field, some more careful checks
    • for instance, for a "code" field : does it correspond to a value that's legal for my application ?

If all that goes OK -- well, not much more to do, excepts use the data ;-)
And when you're done with one line, just go repeat for the next one.


Of course, if you want to either accept or reject a whole file before doing any database (or anything like that) write, you'll have to :

  • parse the file, line by line, applying the "verifying" ideas
  • store the data of each line in memory
  • and, when the whole file has been read to memory,
    • either start using the data
    • or, if there's been an error on one line, reject everything.


In your specific case, you have three kind of fields :

Date;Id;Shown
15-Mar-10;231;345
15-Mar-10;232;346

From what I can guess :

  • The first one must be a date
    • Using some regex to validate that will not be easy : there are not the same number of days each month, there are many months, there is not the same number of days in february depending on the year, ...
    • In such a case, I would probably try to parse the date with something like strtotime (not sure it's ok for the format you're using, though)
    • Or I would just explode the string
      • making sure there are three parts
      • that the third one is 2 digits
      • that the second one is one of Jan, Feb, Mar, ...
      • That the first one corresponds to the correct number of days, depending on the two others
  • The second one :
    • must be an integer
    • must be a valid value, that exists in your database ?
      • If so, a simple SQL query will allow you to check that
  • For the third one, not really sure...
    • I'm guessing it has to be an integer ?
Pascal MARTIN
I am not sure of how to use regular expressions to match the patterns. I assume that by verifying you mean to say that I need to check for patterns and see if each data in the csv file matches that data and if yes than it has been validated and if not than it has not been validated.
Rachel
What would the regex expression for the my example data above and how can I learn about regex patterns for my other data samples as I am not sure of using regex expressions.
Rachel
Also how can I verify if the data in csv file is ok ?
Rachel
I've edited my answer to provide some additionnal informations ; hope this helps :-)
Pascal MARTIN
I am not sure of an regex expression which I need to use to parse through my file and validate the content, mentioned in questions comments. Can you provide some guidance with that.
Rachel
Not sure I would use a regular expression, here ;; and, as I said, I would not try to validate the content of the file before actually reading it with `fgetcsv` : instead, I would valide each line on the fly, while going through the file.
Pascal MARTIN
Is it possible to validate a complete row of data instead of going line by line and doing validation ?
Rachel
A: 

Expect the data you are reading is valid, and simply ignore any lines that seem invalid or are of an unexpected format.

CSV is used for data exchange or as a data storage. So it's very likely that it was already “valid” when the files was generated. If you – for whatever reason – have a CSV file as user input (the only real source where invalid or unexpected data can come from), there is no problem with ignoring that data and telling the user about the invalid lines.

poke
A: 

I wrote an open source Python tool to simplify validation of such files available from http://pypi.python.org/pypi/cutplace/.

The basic idea is that you describe the data format in a structured interface specification using OpenOffice.org, Excel or plain CSV. This is done in a few minutes and legible enough to serve as documentation too. We use it to validate files with about 200.000 rows on a daily base.

You can validate a CSV file using the command line:

cutplace specification.csv data.csv

In case invalid data rows are found, the exit code is 1. If you need more control, you can write a little Python script that imports the cutplace module and adds a listener for validation events.

As example, here's a specification that would validate the sample data you provided, filling the gaps of your short description by making a few assumptions. (I'm writing the specification in CSV to inline it in this post. In practice I prefer OpenOffice.org's Calc and ODS because I can use more formating and make it easier to read and maintain.)

,"Interface: Show statistics"
,
,"Data format"
"D","Format","CSV"
"D","Item delimiter",";"
"D","Header","1"
"D","Encoding","ASCII"
,
,"Fields"
,"Name","Example","Empty","Length","Type","Rule"
"F","date","15-Mar-10",,,"RegEx","\d\d-[A-Z][a-z][a-z]-\d\d"
"F","id","231",,,"Integer","0:"
"F","shown","345",,,"Integer","0:"
,
,"Checks"
,"Description","Type","Rule"
"C","id per date must be unique","IsUnique","date, id"

Lines starting with "D" describe the basic data format. In this case it is a CSV file using ";" as delimiter with 1 header line in ASCII encoding.

Lines starting with "F" describe the various fields. For example,

,"Name","Example","Empty","Length","Type","Rule"
"F","id","231",,,"Integer","0:"

defines a mandatory field "id" of type Integer with a value of 0 or greater. To allow the field to be empty, specify an "X" in the "Empty" column:

,"Name","Example","Empty","Length","Type","Rule"
"F","id","231","X",,"Integer","0:"

Finally there is an optional section to contain more advances checks spawning the whole file, not only single rows. For example, if each date in your file must provide date for an id only once, you can state this using:

,"Description","Type","Rule"
"C","id per date must be unique","IsUnique","date, id"

Any row that starts with an empty column can contain any text you like and will not be processed during validation. This is useful for headings, comments and so on.