views:

382

answers:

3

I am writing an ETL (in python with a mongodb backend) and was wondering : what kind of standard functions and tools an ETL should have to be called an ETL ?

This ETL will be as general purpose as possible, with a scriptable and modular approach. Mostly it will be used to keep different databases in sync, and to import/export datasets in different formats (xml and csv) I don't need any multidimensional tools, but it is a possibility that it'll needed later.

+3  A: 

Here's a random list, in no particular order:

  1. Connect to a wide range of sources, including all the major relational databases.
  2. Handle non-relational data sources like text files, Excel, XML, etc.
  3. Allow multiple sources to be mapped into a single target.
  4. Provide a tool to help map from source to target fields.
  5. Offer a framework for injecting transformations at will.
  6. Programmable API for writing complex transformations.
  7. Optimize load process for speed.
duffymo
I'd like to see "graphical" removed from #4. While the pictures are cute, we always wind up reading code to see what's *really* happening.
S.Lott
@S. Lott - Done.
duffymo
+3  A: 

Let's think of the ETL use cases for a moment.

  1. Extract.
    • Read databases through a generic DB-API adapter.
    • Read flat files through a similar adapter.
    • Read spreadsheets through a similar adapter.
  2. Cleanse.
    • Arbitrary rules
    • Filter and reject
    • Replace
    • Add columns of data
  3. Profile Data.
    • Statistical frequency tables.
  4. Transform (see cleanse, they're two use cases with the same implementation)
  5. Do dimensional conformance lookups.
    • Replace values, or add values.
  6. Aggregate.
    • At any point in the pipeline
  7. Load.
    • Or prepare a flat-file and run the DB product's loader.

Further, there are some additional requirements that aren't single use cases.

  • Each individual operation has to be a separate process that can be connected in a Unix pipeline, with individual records flowing from process to process. This uses all the CPU resources.

  • You need some kind of time-based scheduler for places that have trouble reasoning out their ETL preconditions.

  • You need an event-based schedule for places that can figure out the preconditions for ETL processing steps.

Note. Since ETL is I/O bound, multiple threads does you little good. Since each process runs for a long time -- especially if you have thousands of rows of data to process -- the overhead of "heavyweight" processes doesn't hurt.

S.Lott
A: 

Automatic / heuristic mapping of column names. E.g simple string mappings:

DB1: customerId

DB2: customer_id

I find a lot of the work I (have) done in DTS / SSIS could've been automatically generated.

  • not necessarily "required functionality", but would keep a lot of your users very happy indeed.
geejay