tags:

views:

256

answers:

3

Hey guys,

I have recently been exposed to some ETL tools such as Talend and Apatar and I was wondering what exactly the purpose/main goal of these tools is in laymans terms. Who primarily uses them and if you use them, how they are (from my understanding) better than just writing some type of scripts.

+4  A: 

ETL stands for "Extract/Transform/Load". These tools take data from one source and move it into another. You can map schemas from the source to the destination in unique ways, transform and cleanse data before it moves into the destination, and load the destination in an efficient manner. You can schedule ETL jobs as batch processes.

Those data sources can be relational databases, spreadsheets, XML files, etc.


Who "uses" them? Depends on what you mean by "uses". They're just code and most of the time, they're scheduled as part of regular operations. There are no end-user features. They're totally for programmers to create and operations to operate.

Advantage over scripts? None. They are scripts written in a domain-specific language (DSL) focused entirely on "extract" from source, "transform" and "load" to destination. Most of the interesting part of the script is the field-by-field mappings at each stage.

duffymo
@duffymo: I just had to jump in and add to an excellent foundation.
S.Lott
@S.Lott: I'm flattered that someone like you would think anything that I wrote was 'excellent'. Thanks for the improvement and continuing education. Been reading your blog - pretty awesome. If I could ever climb the Python learning curve fast enough I'd love to work with somebody like you.
duffymo
+1  A: 

ETL is commonly used in data warehousing applications.

For example, you might have an Oracle or Sql Server order processing system. This might keep all the data until the order is shipped, but you wouldn't want years worth of old orders clogging up the system.

Additionally, you might have several systems like this in your company, all developed independently of each other.

So, to consolidate the historical data, you might set up a data warehouse where the data from all of these disparate systems end up, allowing you a nice place to do reporting, planning, data mining, etc.

Since all the data sources are different, and the kinds of data you want to store long-term might differ than the data you have in the smaller databases, you set up an ETL system to convert and manage the data flow.

Mark Harrison
+1  A: 

Let me point you to my answer to a related question.

runrig