views:

63

answers:

3

I'm currently parsing CSV tables and need to discover the "data types" of the columns. I don't know the exact format of the values. Obviously, everything that the CSV parser outputs is a string. The data types I am currently interested in are:

  1. integer
  2. floating point
  3. date
  4. boolean
  5. string

My current thoughts are to test a sample of rows (maybe several hundred?) in order to determine the types of data present through pattern matching.

I am particularly concerned about the date data type - is their a python module for parsing common date idioms (obviously I will not be able to detect them all)?

What about integers and floats?

+2  A: 

Dateutil comes to mind for parsing dates.

For integers and floats you could always try a cast in a try/except section

>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
...   cg = float(g)
... except:
...   print "g is not a float"
...
g is not a float
>>>
Vinko Vrsalovic
`dateutil.parser` looks promising, thank you.
fmark
+2  A: 

ast.literal_eval() can get the easy ones.

Ignacio Vazquez-Abrams
Would *never* have found this one on my own! Thanks.
fmark
-1 ast.literal_eval() as one might imagine is more suitable for CODE than data ... for example, ast.literal_eval(" 123 ") raises IndentationError, while int(" 123 ") isn't spooked by whitespace. Look at the result of `field = "some text"; ast.literal_eval(field)` ... `ValueError: malformed string`. No use on booleans: works with True, but TRUE (usual Excel CSV output) -> malformed string. Useless with dates.
John Machin
+1  A: 

The data types I am currently interested in are...

These do not exist in a CSV file. The data is only strings. Only. Nothing more.

test a sample of rows

Tells you nothing except what you saw in the sample. The next row after your sample can be a string which looks entirely different from the sampled strings.

The only way you can process CSV files is to write CSV-processing applications that assume specific data types and attempt conversion. You cannot "discover" much about a CSV file.

If column 1 is supposed to be a date, you'll have to look at the string and work out the format. It could be anything. A number, a typical Gregorian date in US or European format (there's not way to know whether 1/1/10 is US or European).

try:
    x= datetime.datetime.strptime( row[0], some format )
except ValueError:
    # column is not valid.

If column 2 is supposed to be a float, you can only do this.

try:
    y= float( row[1] )
except ValueError:
    # column is not valid.

If column 3 is supposed to be an int, you can only do this.

try:
    z= int( row[2] )
except ValueError:
    # column is not valid.

There is no way to "discover" if the CSV has floating-point digit strings except by doing float on each row. If a row fails, then someone prepared the file improperly.

Since you have to do the conversion to see if the conversion is possible, you might as well simply process the row. It's simpler and gets you the results in one pass.

Don't waste time analyzing the data. Ask the folks who created it what's supposed to be there.

S.Lott