ansaurus

Question

Answer 1

A:

Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.

Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.

Something like...

/"[a-zA-Z' -]+"/

...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.

Amber 2009-09-07 05:28:10

This won't work for people with non-ASCII characters in their names, like my friend Zoë.

Dominic Rodger 2009-09-07 05:45:02

Correct, which is why you have to consider very carefully what characters you allow based on your target data.

Amber 2009-09-07 06:30:59

@Dav - fair enough :-)

Dominic Rodger 2009-09-07 14:59:56

Answer 2

+2 A:

I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).

Python would be an excellent tool for the job.

I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.

import csv
import math

dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
    if len(row) != 4:
        print 'Invalid row length.'
        return

    my_int = int(row[2])
    my_float = float(row[3])

    if math.isnan(my_int):
        print 'Bad int found'
        return

    if math.isnan(my_float):
        print 'Bad float found'
        return

print 'All good!'

Dominic Rodger 2009-09-07 05:48:01

Better code than mine! +1 from me

Santi 2009-09-07 05:56:19

Answer 3

+1 A:

Here's a small snippet I made:

import csv 

f = csv.reader(open("test.csv"))

for value in f:
    value[0] = str(value[0])
    value[1] = str(value[1])
    value[2] = int(value[2])
    value[3] = float(value[3])

If you run that with a file that doesn't have the format your specified, you'll get an exception:

$ python valid.py 
Traceback (most recent call last):
  File "valid.py", line 8, in <module>
    i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'

You can then make a try-except ValueError to catch it and let the users know what they did wrong.

Santi 2009-09-07 05:53:40

Answer 4

+1 A:

There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.

Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.

You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.

Garen 2009-09-07 06:05:08

Answer 5

A:

Try Cutplace. It verifies that tabluar data conforms to an interface control document.

wierob 2009-09-07 14:05:57

Answer 6

+1 A:

Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).

from pyparsing import *

integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
        integer + COMMA + floatnum + LineEnd()

tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()

for t in tests:
    print t
    try:
        print validLine.parseString(t).asList()
    except ParseException, pe:
        print pe.markInputline('?')
        print pe.msg
    print

Prints

"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]

"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]

bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes

"bad","good2",100,3
"bad","good2",100,?3
Expected float

"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","

You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:

dblQuotedString.setParseAction(removeQuotes)

If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:

comment = '#' + restOfline
validLine.ignore(comment)

You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):

validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
        integer("qty") + COMMA + floatnum("price") + LineEnd()

And your post-processing code can then do this:

data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price

Paul McGuire 2009-09-07 15:46:20

ansaurus

tags:

views:

answers:

Verify CSV against given format

related questions