What are the required functionnalities of ETL frameworks ?

views:

382

answers:

What are the required functionnalities of ETL frameworks ?

I am writing an ETL (in python with a mongodb backend) and was wondering : what kind of standard functions and tools an ETL should have to be called an ETL ?

This ETL will be as general purpose as possible, with a scriptable and modular approach. Mostly it will be used to keep different databases in sync, and to import/export datasets in different formats (xml and csv) I don't need any multidimensional tools, but it is a possibility that it'll needed later.

+3 A:

Here's a random list, in no particular order:

Connect to a wide range of sources, including all the major relational databases.
Handle non-relational data sources like text files, Excel, XML, etc.
Allow multiple sources to be mapped into a single target.
Provide a tool to help map from source to target fields.
Offer a framework for injecting transformations at will.
Programmable API for writing complex transformations.
Optimize load process for speed.

duffymo 2009-08-24 09:49:33

I'd like to see "graphical" removed from #4. While the pictures are cute, we always wind up reading code to see what's *really* happening.

S.Lott 2009-08-24 10:18:31

@S. Lott - Done.

duffymo 2009-08-24 22:26:08

+3 A:

Let's think of the ETL use cases for a moment.

Extract.
- Read databases through a generic DB-API adapter.
- Read flat files through a similar adapter.
- Read spreadsheets through a similar adapter.
Cleanse.
- Arbitrary rules
- Filter and reject
- Replace
- Add columns of data
Profile Data.
- Statistical frequency tables.
Transform (see cleanse, they're two use cases with the same implementation)
Do dimensional conformance lookups.
- Replace values, or add values.
Aggregate.
- At any point in the pipeline
Load.
- Or prepare a flat-file and run the DB product's loader.

Further, there are some additional requirements that aren't single use cases.

Each individual operation has to be a separate process that can be connected in a Unix pipeline, with individual records flowing from process to process. This uses all the CPU resources.
You need some kind of time-based scheduler for places that have trouble reasoning out their ETL preconditions.
You need an event-based schedule for places that can figure out the preconditions for ETL processing steps.

Note. Since ETL is I/O bound, multiple threads does you little good. Since each process runs for a long time -- especially if you have thousands of rows of data to process -- the overhead of "heavyweight" processes doesn't hurt.

S.Lott 2009-08-24 10:26:51

Automatic / heuristic mapping of column names. E.g simple string mappings:

DB1: customerId

DB2: customer_id

I find a lot of the work I (have) done in DTS / SSIS could've been automatically generated.

not necessarily "required functionality", but would keep a lot of your users very happy indeed.

geejay 2009-08-24 11:38:42

ansaurus

tags:

views:

answers:

What are the required functionnalities of ETL frameworks ?

related questions