views:

151

answers:

3

We've worked hard to work up a full dimensional database model of our problem, and now it's time to start coding. Our previous projects have used hand-crafted queries constructed by string manipulation.

Is there any best/standard practice for interfacing between python and a complex database layout?

I've briefly evaluated SQLAlchemy, SQLObject, and Django-ORM, but (I may easily be missing something) they seem tuned for tiny web-type (OLTP) transactions, where I'm doing high-volume analytical (OLAP) transactions.

Some of my requirements, that may be somewhat different than usual:

  1. load large amounts of data relatively quickly
  2. update/insert small amounts of data quickly and easily
  3. handle large numbers of rows easily (300 entries per minute over 5 years)
  4. allow for modifications in the schema, for future requirements

Writing these queries is easy, but writing the code to get the data all lined up is tedious, especially as the schema evolves. This seems like something that a computer might be good at?

+2  A: 

SQLAlchemy definitely. Compared to SQLAlchemy, all other ORMs look like child's toy. Especially the Django-ORM. What's Hibernate to Java, SQLAlchemy is to Python.

dekomote
I've tried to learn the SA ORM three times now and each time I come against completely bizarre errors that make me not want to touch it ever again. How did you learn it?
bukzor
documentation and coding.
dekomote
"completely bizarre errors" or "assumptions of mine that turned out to be untrue out SA"? Often, we make assumptions, find that the software doesn't match our assumptions, and call the whole thing "bizarre". Without details, it's impossible to evaluate your assertion. If you have problems, please post them as questions.
S.Lott
@S.Lott: You're probably right, but after repeated attempts, I'm tempted to think that if my assumptions are repeatedly wrong, then probably it's not the tool for me. I'm interested in @user184757's approach; the SA core expression language seems more sane.
bukzor
+5  A: 

Don't get confused by your requirements. One size does not fit all.

load large amounts of data relatively quickly

Why not use the databases's native loaders for this? Use Python to prepare files, but use database tools to load. You'll find that this is amazingly fast.

update/insert small amounts of data quickly and easily

That starts to bend the rules of a data warehouse. Unless you're talking about Master Data Management to update reporting attributes of a dimension.

That's what ORM's and web frameworks are for.

handle large numbers of rows easily (300 entries per minute over 5 years)

Again, that's why you use a pipeline of Python front-end processing, but the actual INSERT's are done by database tools. Not Python.

alter schema (along with python interface) easily, for future requirements

You have almost no use for automating this. It's certainly your lowest priority task for "programming". You'll often do this manually in order to preserve data properly.

BTW, "hand-crafted queries constructed by string manipulation" is probably the biggest mistake ever. These are hard for the RDBMS parser to handle -- they're slower than using queries that have bind variables inserted.

S.Lott
Thanks. What I meant by "alter" was "don't lock down the schema". Many times when interacting with a database, you reach a certain threshold of lines of code to be modified where it's not worth it to edit the schema anymore. I'd like to avoid that situation entirely if possible.
bukzor
When loading into the database, how do you handle new data that overlaps the old? Or do you simply load all the data each time? That seems prohibitively inefficient.
bukzor
@bukzor: Data Warehouse (almost always) means insert-only. New data --- by definition -- is *new* and does not *overlap* the old. It may share some common keys, but it's *new*. Please find a copy of Kimball's The Data Warehouse Toolkit before proceeding.
S.Lott
@S.Lott: i got it last week but I'm still on chapter 2... Does that mean you start with a blank database always? Beyond that, at least some of my data will "overlap": jobs end, machines are upgraded, etc. I note that nearly all of the slowly-changing-dimesion types require updating the old data to expire it.
bukzor
@bukzor: "update reporting attributes of a dimension". I believe that's the use case for proper SQL. "you start with a blank database always?" Kind of. You often know the dimensions in advance. You rarely know the facts in advance. "jobs end, machines are upgraded" sounds like dimension updates to me. These are small, isolated, specialized, rare things. Not like fact loading, which is 80-90% of the database change.
S.Lott
+3  A: 

I'm using SQLAlchemy with a pretty big datawarehouse and I'm using it for the full ETL process with success. Specially in certain sources where I have some complex transformation rules or with some heterogeneous sources (such as web services). I'm not using the Sqlalchemy ORM but rather using its SQL Expression Language because I don't really need to map anything with objects in the ETL process. Worth noticing that when I'm bringing a verbatim copy of some of the sources I rather use the db tools for that -such as PostgreSQL dump utility-. You can't beat that. SQL Expression Language is the closest you will get with SQLAlchemy (or any ORM for the matter) to handwriting SQL but since you can programatically generate the SQL from python you will save time, specially if you have some really complex transformation rules to follow.

One thing though, I rather modify my schema by hand. I don't trust any tool for that job.

+1: "modify my schema by hand". It's too hard to write a tool for this -- too many special cases.
S.Lott