ansaurus

Question

Inferring appropriate database type declarations from strings in Python

Answer 1

+1 A:

You can determine integers and floats unsafely by type(eval(elem)), where elem is an element of the list. (But then you need to check elem for possible bad code)

A safer way could be to do the following

a = ['24.2', '.2', '2']
try:
    if all(elem.isdigit() for elem in a):
     print("int")
    elif all(float(elem) for elem in a):
     print("float")
except:
    i = len(a[0])
    if all(len(elem)==i for elem in a):
     print("varchar(%s)"%i)
    else:
     print "n/a"

jacob 2009-06-04 19:21:43

I also don't think the eval was unsafe (in my case), and it was a good answer. I may have admins uploading data so in some hypothetical there could be a risk but I did not stipulate unreliable data.

bvmou 2009-06-06 06:16:23

Answer 2

+5 A:

Don't use eval. If someone inserts bad code, it can hose your database or server.

Instead use these

def isFloat(s):
try:
    float(s)
    return True
except (ValueError, TypeError), e:
    return False


str.isdigit()

And everything else can be a varchar

Unknown 2009-06-04 19:26:18

"If someone inserts bad code, it can hose your database or server" What? How is that going to happen? Who is this "someone" that could insert the bad code? What "bad code" will hose the server? 'import sys; sys.crash_server(True)' as a column value?

S.Lott 2009-06-04 19:46:17

+1: avoids explicit type comparisons.

S.Lott 2009-06-04 19:46:55

I forgot about the unsafety of eval. I've updated it below now :)

jacob 2009-06-04 21:18:57

@jacob: eval is far from unsafe. However, this solution makes pythonic use of built-in conversions and exceptions. Very nice.

S.Lott 2009-06-05 15:27:14

Answer 3

A:

Thanks for the help, this is a little long for an update, here is how I combined the answers. I am starting with a list of dicts like this, generated from a dbf file:

dbf_list = [{'Warngentyp': '', 'Lon': '-81.67170', 'Zwatch_war': '0', 'State':...

Then a function that returns 1000 values per column to test for the best db type declaration: {'column_name':['list', 'of', 'sample', 'values'], 'col2':['1','2','3','4'... like this:

def sample_fields(dicts_, number=1000): #dicts_ would be dbf_list from above
    sample = dict([[item, []] for item in dicts_[1]])
    for dict_ in dicts_[:number]:
        for col_ in dict_:
            sample[col_].append(dict_[col_])
    return sample

Then you combine the Unknown and jacob approach: varchar is a good default and floats and ints are basically enough for everything else, all is clear and fast:

def find_typedefs(sample_dict): #arg is output of previous function
    defs_ = {}
    for key in sample_dict:
        defs_[key] = 'varchar(255)'
        try:
            if all([int(value) for value in sample_dict[key]]):
                defs_[key] = 'int'
        except:
            try:
                if all([float(value) for value in sample_dict[key]]):
                    defs_[key] = 'float'
            except:
                continue
    return defs_

Then format the returned dict into a create table statement, iterate over the values in the original big list and feed them into the database. It works great, I am now skipping the intermediate sqlite step, thanks again.

Update for John Machin: I am using the shp2pgsql library distributed with PostGIS. It creates schema like the below with a source like this one:

   Column   |         Type          | 
------------+-----------------------+-
 gid        | integer               |
 st_fips    | character varying(7)  | 
 sfips      | character varying(5)  | 
 county_fip | character varying(12) | 
 cfips      | character varying(6)  | 
 pl_fips    | character varying(7)  | 
 id         | character varying(7)  | 
 elevation  | character varying(11) | 
 pop_1990   | integer               | 
 population | character varying(12) | 
 name       | character varying(32) | 
 st         | character varying(12) | 
 state      | character varying(16) | 
 warngenlev | character varying(13) | 
 warngentyp | character varying(13) | 
 watch_warn | character varying(14) | 
 zwatch_war | bigint                | 
 prog_disc  | bigint                | 
 zprog_disc | bigint                | 
 comboflag  | bigint                | 
 land_water | character varying(13) | 
 recnum     | integer               | 
 lon        | numeric               | 
 lat        | numeric               | 
 the_geom   | geometry              |

There is stuff there that has to be wrong -- Fips is the federal information processing standard, and it should be an integer between 0 and something like 100,000. Population, elevation, etc. Maybe I have more of a postgres specific question, I wouldn't mind loosing a small amount of data, or pushing it into a table for errors or something, while trying to change the type on say the population field. How strict is the dbf type checking? For example I see that population per shp2pgsql is varchar(12). Is it possible that some small percentage of population fields contain something like '2,445 Est.'? If I take the approach I set out in this question, with the first thousand records, I get a schema like this:

   Column   |          Type          |
------------+------------------------+-
 warngentyp | character varying(255) | 
 lon        | double precision       | 
 zwatch_war | character varying(255) | 
 state      | character varying(255) | 
 recnum     | character varying(255) | 
 pop_1990   | integer                | 
 land_water | character varying(255) | 
 elevation  | integer                | 
 prog_disc  | integer                | 
 comboflag  | character varying(255) | 
 sfips      | integer                | 
 zprog_disc | integer                | 
 pl_fips    | integer                | 
 county_fip | integer                | 
 population | integer                | 
 watch_warn | integer                | 
 name       | character varying(255) | 
 st         | character varying(255) | 
 lat        | double precision       | 
 st_fips    | integer                | 
 cfips      | integer                | 
 id         | integer                | 
 warngenlev | integer                |

On the other hand if I check every value in the all(['list', 'of', 'everything'...]), I get a schema more like the first one. I can tolerate a bit of data loss here -- if the entry for some town is wrong and it doesn't significantly affect the population figures, etc.

I am only using an old package called dbview to pipe the dbf files into these scripts -- I am not trying to map any of the format's native capability. I assumed that shp2pgsql would have picked the low-hanging fruit in that regard. Any suggestions for either dbview or another package is welcome -- although there are other cases where I may not be working with dbf files and would need to find the best types anyway. I am also going to ask a question about postgresql to see if I can find a solution at that level.

bvmou 2009-06-06 00:01:24

Answer 4

+1 A:

YOU DON'T NEED TO INFER THE TYPE DECLARATIONS!!!

You can derive what you want directly from the .dbf files. Each column has a name, a type code (C=Character, N=Number, D=Date (yyyymmdd), L=Logical (T/F), plus more types if the files are from Foxpro), a length (where relevant), and a number of decimal places (for type N).

Whatever software that you used to dig the data out of the .dbf files needed to use that information to convert each piece of data to the appropriate Python data type.

Dictionaries? Why? With a minor amount of work, that software could be modified to produce a CREATE TABLE statement based on those column definitions, plus an INSERT statement for each row of data.

I presume you are using one of the several published Python DBF-reading modules. Any one of them should have the facilities that you need: open a .dbf file, get the column names, get the column type etc info, get each row of data. If you are unhappy with the module that you are using, talk to me; I have an unpublished one that as far as reading DBFs goes, combines the better features of the others, avoids the worst features, is as fast as you'll get with a pure Python implementation, handles all the Visual Foxpro datatypes and the _NullFlags pseudo-column, handles memoes, etc etc.

HTH

========= Addendum: When I said you didn't need to infer types, you hadn't made it plain that you had a bunch of fields of type C which contained numbers.

FIPS fields: some are with and some without leading zeroes. If you are going to use them, you face the '012' != '12' != 12 problem. I'd suggest stripping off the leading zeroes and keeping them in integer columns, restoring leading zeroes in reports or whatever if you really need to. Why are there 2 each of state fips and county fips?

Population: in the sample file, almost all are integer. Four are like 40552.0000, and a reasonable number are blank/empty. You seem to regard population as important, and asked "Is it possible that some small percentage of population fields contain .... ?" Anything is possible in data. Don't wonder and speculate, investigate! I'd strongly advise you to sort your data in population order and eyeball it; you'll find that multiple places in the same state share the same population count. E.g. There are 35 places in New York state whose pop'n is stated as 8,008,278; they are spread over 6 counties. 29 of them have a PL_FIPS value of 51000; 5 have 5100 -- looks like a trailing zero problem :-(

Tip for deciding between float and int: try anum = float(chars) first; if that succeeds, check if int(anum) == anum.

ID: wonderful "unique ID"; 59 cases where it's not an int -- several in Canada (the website said "US cities"; is this an artifact of some unresolved border dispute?), some containing the word 'Number', and some empty.

Low-hanging fruit: I would have thought that deducing that population was in fact integer was 0.1 inches above the ground :-)

There's a serious flaw in that if all([int(value) ... logic:

>>> all([int(value) for value in "0 1 2 3 4 5 6 7 8 9".split()])
False
>>> all([int(value) for value in "1 2 3 4 5 6 7 8 9".split()])
True
>>>

You evidently think that you are testing that all the strings can be converted to int, but you're adding the rider "and are all non-zero". Ditto float a few lines later.

IOW if there's just one zero value, you declare that the column is not integer. Even after fixing that, if there's just one empty value, you call it varchar. What I suggest is: count how many are empty (after normalising whitespace (which should include NBSP)), how many qualify as integer, how many non-integer non-empty ones qualify as float, and how many "other". Check the "other" ones; decide whether to reject or fix; repeat until happy :-)

I hope some of this helps.

John Machin 2009-06-06 15:03:41

+1, I would love to try your module, if you don't mind sharing I am my username at yahoo. It may be that I am using a library that is really designed for something else, I expanded the self-answer per your comments, will check back here if you want to add to your post.

bvmou 2009-06-06 23:11:56

I'll send it to you later.

John Machin 2009-06-07 10:40:19

Alas this is the best dataset of its kind I have been able to find. No zeroes so far in the test fields, or in the error log I am using for failed inserts. I have dug into the data and there are basically 88 incorrigible rows, places in Colorado and Canada and NJ with things like arithmetic operators for names. I am down to .2% error rate. 88 out of 40k+ is OK, and Canada should be excluded anyway. "Credit Island, Iowa" would be nice to have, though. My goal was to automate a class of imports, and I'm getting there -- I am studying and using your module, btw, you will get lots of ...

bvmou 2009-06-09 02:01:11

... downloads if you put it on google code. These dbf files are popping all over now, in a variety of vintages, maybe related to government geodata disclosures. Thanks again for the help.

bvmou 2009-06-09 02:03:08

ansaurus

tags:

views:

answers:

Inferring appropriate database type declarations from strings in Python

related questions