etl

How to map between two code sets (enumerations) using talend

Hi suppose I have the following Source Table (called S) Table S: name Gender Code Bob 0 Nancy 1 Ruth 1 David 0 And let assume I also have the a lookup values table (called S_gender_values) Gender_Code Gender_value 0 Male 1 Female My goal is to create a target tabl...

SQL Server Integration Services - Incremental data load hash comparison

Using SQL Server Integration Services (SSIS) to perform incremental data load, comparing a hash of to-be-imported and existing row data. I am using this: http://ssismhash.codeplex.com/ to create the SHA512 hash for comparison. When trying to compare data import hash and existing hash from database using a Conditional Split task (expres...

What is Extract/Transform/Load (ETL)?

I've tried reading the Wikipedia article for "extract, transform, load", but that just leaves me more confused... Can someone explain what ETL is, and how it is actually done? ...

What is the required language knowledge to use Informatica effectively?

In the next few weeks, my company will be engaging multiple vendors to establish a choice for a common global ETL tool - not necessarily one that can't be broken from, but just where our license investment will go to consolidate those costs. Two of the major players are Talend and Informatica, with others that are unimportant for the sak...

ETL Performance Problem

I have an important problem running ETL Process in production environment. While my ETL is running, the OLAP Server turns extremely slowly, I think this is because the ETL is updating several existing rows in the fact table and adding new ones. I tried to avoid this problem having a whole data base replication and ETL writes in DB1 and O...

Extracting Data Client Side

I need to be able to extract and transform data from a data source on a client machine and ship it off via a web service call to be loaded into our data store. I would love to be able leverage SSIS but the Sql Server licensing agreement is preventing me from installing Integration Services on a client machine. Can I just provide the clie...

ETL - Checking for updated dimension data

I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings. In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form). There are curr...

How to create a view spliting one column to 2 or more using a regular expression?

We have a non normalized table that contains foreign key infomration as free text inside a column. I would like to create a view that will transform and normalize that table. E.g. a column that contains the following text: "REFID:12345, REFID2:67890" I want to create a view that will have REFID1 and REFID2 as 2 separate integer colu...

Best practice for operating on large amounts of data

I need to do a lot of processing on a table that has 26+ million rows: Determine correct size of each column based on said column's data Identify and remove duplicate rows. Create a primary key (auto incrementing id) Create a natural key (unique constraint) Add and remove columns Please list your tips on how to speed this process up ...

Nightly database restores - SSIS package - SQL Server 2005

We have an SSIS package that runs nightly which takes the backup of a couple of production databases, restores to a staged database, sensitive information is removed and then the backup of this staged database gets restored on another server so that the hyperion guys can run their jobs. The whole process used to take around 4 and half ho...

Spring-Batch for a massive nightly / hourly Hive / MySQL data processing

I'm looking into replacing a bunch of Python ETL scripts that perform a nightly / hourly data summary and statistics gathering on a massive amount of data. What I'd like to achieve is Robustness - a failing job / step should be automatically restarted. In some cases I'd like to execute a recovery step instead. The framework must be ab...

MDF file size much larger than actual data

For some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data? More info: Only a few tables with ~25 million rows. No large varchar fields (biggest is 300, most are less than varchar(50). Not very wide tables < 20 columns. Also, no...

Fastest technique to deleting duplicate data

After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed. In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking ...

AS400 to Oracle 10g via xml with Informatica Powercenter

Is the following workflow possible with Informatica Powercenter? AS400 -> Xml(in memory) -> Oracle 10g stored procedure (pass xml as param) Specifically, I need to take a result set eg. 100 rows. Convert those rows into a single xml document as a string in memory and then pass that as a parameter to an Oracle stored procedure that is ...

Switch from Web programming to data warehousing? Should I?

I was looking a report on internet that data warehousing is much lucrative and highly paid IT career. I am talking about technologies like abinitio etl datastage teradata. I work in ASP.net and sql server 05. Is it a good thought to move from web programming to data warehousing technologies. Since I would have no experience with data war...

Can I unit test Informatica Powercentre workflows?

Can I unit test Informatica Powercentre workflows? EDIT: More specifically, can I mock sources and target and test the steps in between? Eg. If I have a workflow with a Oracle source and a text file target can I test it without Oracle and a text file.? ...

How to use EzAPI FlatFile Source in SSIS?

I am using the EzAPI to create a dataflow with FlatFile Source public class EzOleDbToFilePackage : EzSrcDestPackage<EzFlatFileSource, EzFlatFileCM, EzOleDbDestination, EzSqlOleDbCM> Using the example from http://blogs.msdn.com/b/mattm/archive/2008/12/30/ezapi-alternative-package-creation-api.aspx I am trying to use a flat file source....

SQL Server Management Studio: Import quietly ignoring 99.9% of data

The Problem i'm trying to import data into a table using SQL Server Management Studio's Import Data task. It only brings in 26 rows, out of the original 49,325. (Edit: That's where 99.9% comes from: (1-26/49325)*100 = 99.9% Using DTS in Enterprise Manager correctly brings all 49,325 rows. Why is SSMS not importing all rows, reporting ...

Is this a proper idea of BI workflow ?

Hi all, I am new to Business Intelligence. I just got hired by a company in order to complete their websolution, implementing a BI Module. After lot of reading, I think I could get an idea of what a BI Process looks like, you'll find enclose my idea of a BI process. Can you please tell me if this is a correct vision of the all workflo...

Does anybody know the list of Pentaho Data Integration (Kettle) connectors list ?

Hi all I am doing comparison between three open source ETL tools Talend, Kettle and CloverETL. I could find with no problem Talend and CloverETL's connector list. But, I cannot find the one for Kettle. Does someone knows them or where can I find them ? Thanks a lot, ...