views:

668

answers:

5

I am supposed to perform ETL where source is a large and badly designed sql 2k database and a a better designed sql 2k5 database. I think SSIS is the way to go. Can anyone suggest a to-do list or a checklist or things to watchout for so that I dont forget anything? How should I approach this so that it does not bite me in the rear later on.

+1  A: 

We're doing a huge ETL (moving a client from legacy AS400 apps to Oracle EBS), and we actually have a process that (with modifications) I can recommend:

  1. Identify the critical target tables/fields.
  2. Identify the critical source tables/fields.
  3. Work with the business users to map source to target.
  4. Analyze the source data for quality issues.
  5. Determine who's responsible for data quality issues identified.
  6. Have responsible parties clean up the data in the source.
  7. Develop the actual ETL based on the information from steps 1 - 3.

The trickiest steps are 2 & 3 in my experience - it's sometimes difficult to get the business users to correctly identify all the bits they need in one pass, and can be even harder to properly identify exactly where the data is coming from (though that may have something to do with cryptic file and field names that I'm seeing!). However, this process should help you avoid major misses.

Harper Shelby
+3  A: 

I have experience with ETL processes pulling data from 200+ distributed databases to a central database on a daily, weekly, monthly and yearly basis. It is a massive amount of data and there are many issues we have had specific to our situation. But as I see it, there are several items to think about regardless of the situation:

  • Make sure that you take file locks into consideration, both on the source and destination side. Making sure that other processes do not have the files locked (and removing those locks if necessary and it makes sense) is important.

  • locking the files for yourself. Make sure, especially on the source that you lock the files while pulling out the data so that you do not get halfway updated data.

  • if at all possible, pull deltas, not all of the data. Get a copy of the data and then pull only rows that have changed instead of everything. The larger your data set the more important this becomes. Look at journals and triggers if you have to, but as it becomes more important to have this data on a certain basis, this is probably the number one advice I would give you. Even if it adds a significant amount of time to the project.

  • execution log. make sure you know when it worked and when it didn't, and throwing specific errors in the process can really help in debugging.

  • document, document, document. If you build this right, you are going to build it and then not think about it for a long time. But you can be guaranteed, you or someone else will need to come back to it at some point to enhance it or do a bug fix. Documentation is key in these situations.

HTH, ill update this if I think of anything else.

Ryan Guill
+12  A: 

Some general ETL tips

  1. Consider organising it by destination (for example, all the code to produce the Customer dimension lives in the same module, regardless of source). This is sometimes known as Subject-oriented ETL. It makes finding stuff much easier and will increase the maintainability of your code.

  2. If the SQL2000 database is a mess, you will probably find that SSIS data flows are a clumsy way to deal with the data. As a rule, ETL tools scale poorly with complexity; something like half of all data warehouse projects in finance companies are done with stored procedure code as an explicit architectural decision - for precisely this reason. If you have to put a large amount of code in sprocs, consider putting all of the code in sprocs.

    For a system involving lots of complex scrubbing or transformations, a 100% sproc approach is far more maintainable as it is the only feasible way to put all of the transformations and business logic in one place. With mixed ETL/sproc systems, you have to look in multiple places to track, troubleshoot, debug or change the whole transformation.

  3. The sweet spot of ETL tools is on systems where you have a larger number of data sources with relatively simple transformations.

  4. Make the code testable, so you can pick apart the components and test in isolation. Code that can only be executed from within the middle of a complex data flow in an ETL tool is much harder to test.

  5. Make the data extract dumb with no business logic, and copy into a staging area. If you have business logic spread across the extract and transform layers, you will have transformations that cannot be tested in isolation and make it hard to track down bugs. If the transform is running from a staging area you reduce the hard dependency on the source system, again enhancing testability. This is a particular win on sproc-based architectures as it allows an almost completely homogeneous code base.

  6. Build a generic slowly-changing dimension handler or use one off the shelf if available. This makes it easier to unit test this functionality. If this can be unit tested, the system testing does not have to test all of the corner cases, merely whether the data presented to it is correct. This is not as complex as it sounds - The last one I wrote was about 600 or 700 lines of T-SQL code. The same goes for any generic scrubbing functions.

  7. Load incrementally if possible.

  8. Instrument your code - have it make log entries, possibly recording diagnostics such as check totals or counts. Without this, troubleshooting is next to impossible. Also, assertion checking is a good way to think of error handling for this (does row count in a equal row count in b, is A:B relationship really 1:1).

  9. Use synthetic keys. Using natural keys from the source systems ties your system to the data sources, and makes it difficult to add extra sources. The keys and relationships in the system should always line up - no nulls. For errors, 'not recorded', make a specific 'error' or 'not recorded' entries in the dimension table and match to them.

  10. If you build an Operational Data Store (the subject of many a religious war) do not recycle the ODS keys in the star schemas. By all means join on ODS keys to construct dimensions, but match on a natural key. This allows you to arbitrarily drop and recreate the ODS - possibly changing its structure - without disturbing the star schemas. Having this capability is a real maintenance win, as you can change ODS structure or do a brute-force re-deployment of the ODS at any point.

Points 1-2 and 4-5 mean that you can build a system where all of the code for any given subsystem (e.g. a single dimension or fact table) lives in one and only one place in the system. This type of architecture is also better for larger numbers of data sources.

Point 3 is a counterpoint to point 2. Basically the choice between SQL and ETL tooling is a function of transformation complexity and number of source systems. The simpler the data and larger the number of data sources, the stronger the case for a tools-based approach. The more complex the data, the stronger the case for moving to an architecture based on stored procedures. Generally it's better to exclusively or almost exclusively use one or the other but not both.

Point 6 is a general performance tip that you will need to observe for large data volumes. Note that you may only need incremental loading for some parts of a system; for smaller reference tables and dimensions you may not need it.

Point 5 makes your system easier to test. Testing SCD's or any change based functionality is fiddly, as you have to be able to present more than one version of the source data to the system. If you move the change management functionality into infrastructure code, you can test it in isolation with test data sets. This is a win in testing, as it reduces the complexity of your system testing requirements.

Point 7 is germane to any headless process. If it goes tits up during the night, you want some fighting chance of seeing what went wrong the next day. If the code doesn't properly log what's going on and catch errors, you will have a much harder job troubleshooting it.

Point 8 gives the data warehouse a life of its own. You can easily add and drop source systems when the warehouse has its own keys. Warehouse keys are also necessary to implement slowly changing dimensions.

Point 9 is a maintenance and deployment win, as the ODS can be re-structured if you need to add new systems or change the cardinality of a record. It also means that a dimension can be loaded from more than one place in the ODS (think: adding manual accounting adjustments) without a dependency on the ODS keys.

ConcernedOfTunbridgeWells
This is incredibly good advice.
Peter
+2  A: 

Well i'm developing an ETL for the company where i am.

We are working with SSIS. Using the api to generate and build our own dtsx packages.

SSIS it's not friendly for managing errors. Sometimes you get an "OleDb Error" that could have a lot of different meanings depeding on the context.

Read the API Documentation (they don't say much).

Some links to help you out starting there: http://technet.microsoft.com/de-de/library/ms135932(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms345167.aspx

http://msdn.microsoft.com/en-us/library/ms403356.aspx

http://www.codeproject.com/KB/database/SSISProgramming.aspx?display=PrintAll&fid=382208&df=90&mpp=25&noise=3&sort=Position&view=Quick&fr=26&select=2551674

http://www.codeproject.com/KB/database/foreachadossis.aspx

http://wiki.sqlis.com/default.aspx/SQLISWiki/ComponentErrorCodes.html

http://www.new.facebook.com/inbox/readmessage.php?t=1041904880323#/home.php?ref=logo

http://technet.microsoft.com/en-us/library/ms187670.aspx

http://msdn.microsoft.com/ja-jp/library/microsoft.sqlserver.dts.runtime.foreachloop.foreachenumerator.aspx

http://www.sqlis.com/post/Handling-different-row-types-in-the-same-file.aspx

http://technet.microsoft.com/en-us/library/ms135967(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms137709(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms345164(SQL.90).aspx

http://msdn.microsoft.com/en-us/library/ms141232.aspx

http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx

http://www.ivolva.com/ssis_code_generator.html

http://www.ivolva.com/ssis_wizards.html

http://www.codeplex.com/MSFTISProdSamples

http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/Q_23972361.html

http://forums.microsoft.com/MSDN/MigratedForum.aspx?siteid=1&PostID=1404157

http://msdn.microsoft.com/en-us/library/aa719592(VS.71).aspx

http://forums.microsoft.com/MSDN/MigratedForum.aspx?siteid=1&ForumID=80

http://blogs.conchango.com/jamiethomson/archive/2005/06/11/SSIS_3A00_-Custom-Logging-Using-Event-Handlers.aspx

http://blogs.conchango.com/jamiethomson/archive/2007/03/13/SSIS_3A00_-Property-Paths-syntax.aspx

http://search.live.com/results.aspx?q=%s&go=Buscar&form=QBJK&q1=macro%3Ajamiet.ssis

http://toddmcdermid.blogspot.com/2008/09/using-performupgrade.html?showComment=1224715020000

http://msdn.microsoft.com/en-us/library/ms136082.aspx

http://support.microsoft.com/kb/839279/en-us

Sorry for the "spam", but they are very useful to me.

Ironicnet
oh wow...i am sure thats all your bookmarks!!
Perpetualcoder
They are!In moments like this i wonder why i don't have a delicious account..The basic sites for SSIS arehttp://blogs.conchango.com/jamiethomson/http://www.sqlis.com/http://www.sqlservercentral.com/ (check the forum)andhttp://blogs.conchango.com/jamiethomson/Thats a nicer selection. =)
Ironicnet
A: 

This thread is old, but I want to draw your attention to ConcernedOfTunbridgeWells' answer. It is incredibly good advice, on all points. I could reiterate a few, but that would diminish the rest, and they all deserve close study.

Peter
Why thank you ;-) I'm not entirely sure it's appropriate to upvote you just for endorsing my answer but I agree that the points made are indeed good advice - if I say so myself ...
ConcernedOfTunbridgeWells
Yes, those tips are really good and pragmatic.
Pedro