views:

212

answers:

1

This is sort of a general question that has come up in several contexts, the example below is representative but not exhaustive. I am interested in any ways of learning to work with Postgres on imperfect (but close enough) data sources.

The specific case -- I am using Postgres with PostGIS for working with government data published in shapefiles and xml. Using the shp2pgsql module distributed with PostGIS (for example on this dataset) I often get schema like this:

   Column   |         Type          | 
------------+-----------------------+-
 gid        | integer               |
 st_fips    | character varying(7)  | 
 sfips      | character varying(5)  | 
 county_fip | character varying(12) | 
 cfips      | character varying(6)  | 
 pl_fips    | character varying(7)  | 
 id         | character varying(7)  | 
 elevation  | character varying(11) | 
 pop_1990   | integer               | 
 population | character varying(12) | 
 name       | character varying(32) | 
 st         | character varying(12) | 
 state      | character varying(16) | 
 warngenlev | character varying(13) | 
 warngentyp | character varying(13) | 
 watch_warn | character varying(14) | 
 zwatch_war | bigint                | 
 prog_disc  | bigint                | 
 zprog_disc | bigint                | 
 comboflag  | bigint                | 
 land_water | character varying(13) | 
 recnum     | integer               | 
 lon        | numeric               | 
 lat        | numeric               | 
 the_geom   | geometry              |

I know that at least 10 of those varchars -- the fips, elevation, population, etc., should be ints; but when trying to cast them as such I get errors. In general I think I could solve most of my problems by allowing Postgres to accept an empty string as a default value for a column -- say 0 or -1 for an int type -- when altering a column and changing the type. Is this possible?

If I create the table before importing with the type declarations generated from the original data source, I get better types than with shp2pgsql, and can iterate over the source entries feeding them to the database, discarding any failed inserts. The fundamental problem is that if I have 1% bad fields, evenly distributed over 25 columns, I will lose 25% of my data since a given insert will fail if any field is bad. I would love to be able to make a best-effort insert and fix any problems later, rather than lose that many rows.

Any input from people having dealt with similar problems is welcome -- I am not a MySQL guy trying to batter PostgreSQL into making all the same mistakes I am used to -- just dealing with data I don't have full control over.

+1  A: 

Could you produce a SQL file from shp2pgsql and do some massaging of the data before executing it? If the data is in COPY format, it should be easy to parse and change "" to "\N" (insert as null) for columns.

Another possibility would be to use shp2pgsql to load the data into a staging table where all the fields are defined as just 'text' type, and then use an INSERT...SELECT statement to copy the data to your final location, with the possibility of massaging the data in the SELECT to convert blank strings to null etc.

I don't think there's a way to override the behaviour of how strings are converted to ints and so on: possibly you could create your own type or domain, and define an implicit cast that was more lenient... but this sounds pretty nasty, since the types are really just artifacts of how your data arrives in the system and not something you want to keep around after that.

You asked about fixing it up when changing the column type: you can do that too, for example:

steve@steve@[local] =# create table test_table(id serial primary key, testvalue text not null);
NOTICE:  CREATE TABLE will create implicit sequence "test_table_id_seq" for serial column "test_table.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test_table_pkey" for table "test_table"
CREATE TABLE
steve@steve@[local] =# insert into test_table(testvalue) values('1'),('0'),('');
INSERT 0 3
steve@steve@[local] =# alter table test_table alter column testvalue type int using case testvalue when '' then 0 else testvalue::int end;
ALTER TABLE
steve@steve@[local] =# select * from test_table;
 id | testvalue
----+-----------
  1 |         1
  2 |         0
  3 |         0
(3 rows)

Which is almost equivalent to the "staging table" idea I suggested above, except that now the staging table is your final table. Altering a column type like this requires rewriting the entire table anyway: so actually, using a staging table and reformatting multiple columns at once is likely to be more efficient.

araqnid
My gods, replacing the empty strings with "\N" fixed practially all the errors. I am going to figure out how to get shp2pgsql to do this on the original conversion.
bvmou
This gets me down to about .2% error rate, actually more successful with an unquoted `Null` than '\N'.
bvmou