views:

333

answers:

4

I am about to embark on a PostgreSQL project for a client. They want to develop a huge professional database with many complex joins, so after consideration I have chosen to go with PostgreSQL over MySQL.

An important consideration is how to effectively interface to the database with scripts. Currently, the client uses about a million scripts to import and reshape data to their needs, but uses no database (unless you consider CSV files to be a database). With the arrival of a database structure with queries and views, the need for scripts will be less, but importing will still need to be done often, and exporting/reporting as well. For me the ideal end result would be a series of standardized scripts, preferably with a web interface, so that the client can perform regular tasks quickly and error-free with a click of the button.

My question is which scripting approach will be most appropriate. Probably any scripting language with a Postgres or an ODBC plugin would suffice, but I am looking to make a smart choice for the long term. Does anybody have experience with this? Does Postgres offer an internal scripting language, and is it easy to build a GUI for that? Are there any standardized tools available for importing/exporting, and are they customizable enough to allow standardization of tasks to click-level? How about PHP or perl?

Thanks in advance. Any tips, resources, puzzled looks or pitiful gestures will be truly appreciated ;-)

+1  A: 

Since you are talking about scripts that expressly just manipulate the database, I would start with the most native tools.

  • SQL and PL/pgSQL stored functions for manipulating and processing data
  • COPY FROM and COPY TO for importing from and exporting to flat files
  • An ETL tool for any reshaping that can't be handled with the above

Now, you want to provide some easy web interface for interfacing with these scripts. Here the best language is probably the one you or your team already knows. All major languages have Postgres drivers. The language you choose will have very little impact if you keep your data manipulation tasks at the database layer.

One thing to consider is how long the typical script will take to execute. If it is more than a few minutes, then I suggest decoupling it from the web interface. In that case, the web interface should allow the user to queue the script to start so that the server can run it independent of the web request cycle.

cope360
This is solid advice. Thank you. Pending other insights, I'm inclined to go for PHP in a framework such as symphony, using AJAX to place the database calls.
littlegreen
Update: I've now been using M$ SQL Server for a while instead of Postgres, and I've found that internal T-SQL functions and SSIS packages (ETL) are way faster than any other scripting languages connecting over ODBC. I am inclined to think this is the same for Postgres. So I'd advise anyone, start with internal stored procedures and then look at other scripting possibilities.
littlegreen
Small addition: and you are right about the web interface too. It is way faster to let the web interface call native stored procedures, than to do any scripting or manipulation in your PHP/ASP/whatever webscript itself.
littlegreen
+1  A: 

I use Perl. I would suggest it too. As far as database scripting languages that postgres offers: pl/perl is far more developed than pl/php, and pl/perlu brings CPAN to Postgres.

You still have pl/sql (great for small stuff), and pl/pgsql (has its applications too).

Unfortunately, CSV support on Postgres is kind of crufty, I'm going to start a Perl project very soon though to fix this (probably within the next month). Currently, you almost have to use pgloader which (imho) has a kludgey syntax. Perl has Text::CSV_XS, which is a godsend for processing CSVs.

Perl historically has done database access and use much better than PHP. I still believe it is a good way ahead of PHP. PHP still lacks an asynchronous framework, and is rather limited to the web. PHP's strong points tend to be much more centered around non-technical features of the language, namely the learning curve, and shared host portability.

Evan Carroll
Why pgloader? What about "COPY <tablename> FROM <filename|STDIN> WITH CSV"? I never have any issues with that both from psql and from Perl/DBD::Pg.
Matthew Wood
Thanks. This is all very useful info for me. Perhaps a silly question, but is `pl/perl` a full-fledged perl in which I can use `Text::CSV_XS`, or do I need `pl/perlu` for that?
littlegreen
@matthew wood: In the real world COPY *, as currently implemented is never enough, I'm the author of the http://wiki.postgresql.org/wiki/COPY if you want to see the negative aspects of COPY and the problems with it.@littlegreen: `pl/perlu`, just means you can pull in external modules, there is nothing special about it per se, but the use of `Text::CSV_XS` isn't scripting in the DB. I use Text::CSV_XS to transform a CSV for import into psql because of the lack of native CSV functionality in psql. I use `pl/perlu` for instance to create domains for email datatypes, and VIN numbers.
Evan Carroll
While I can understand your frustration when you receive CSV files from a source you can't contact/control/whatever, five of the six issues there are either not relevant to CSV import (NULL AS...) or are improperly due to formatted CSV files. The ONLY one that's really COPY's fault is not coercing locale-formatted datatypes, but I'd rather it not do that since I have to exchange with European data providers and I'm in the US (15.000 would silently convert into 15 on my side if you didn't have both locales defined; better to reject and manually intervene).
Matthew Wood
I'm not sure how you've managed to compartmentalize in such a fashion. All of these problems are because of bad exports, which is a real life commonality when dealing with CSVs. All of these problems are manageable in MSSQL, and MySQL through their CSV loading interface, but yet they aren't manageable with just \COPY. I refuse to divide the blame. Many of Pg's problems are acknowledge upstream and they're simply waiting for someone with sufficient know-how to submit an acceptable patch. Pg didn't even have the ability to process CSVs with headers (discard first row) until 8.1.
Evan Carroll
COPY FROM is a good option, but not very flexible... if you want to do any preprocessing it is better to use an (ETL) loader tool.
littlegreen
+3  A: 

I use Python/Jython to connect to PostgreSQL and do various things.

Pluses:

  • there is pl/python so you can use Python from PostgreSQL
  • you can create "standalone" programs using Python DB API, there is doc about database programming, and PostgreSQL in details
  • you can use Jython if you like JVM environment, especially JDBC driver

Examples of usage:

  • converting PDF, MS Word and OopenOffice documents saved in BLOBs to text to index it
  • importing data from various sources, not only cvs; Python is really strong at converting text data
  • testing drivers while some our apps are native and use ODBC or JDBC drivers and Jython can work with both drivers (for ODBC there is JDBC-ODBC bridge)

On my other posts on SO you can see I use Python/Jython with other databases as Oracle and Informix. For example I created tools that dumps some info from database schema so I can easily compare databases in test and production environment.

Michał Niklas
Thanks. I'll consider using Python as well as Perl.
littlegreen
+1  A: 

SchemaCrawler for PostgreSQL allows you to script against a database using JavaScript.

Sualeh Fatehi
JavaScript, yet another possibility :-) This is getting interesting. I'll put some time into investigating all options and see which is most powerful.
littlegreen