views:

156

answers:

2

I'm tasked with providing a list of metadata requirements our data warehouse developers might need.

This is not the business metadata (nice descriptions etc), but rather data required for change management (also known as impact assesment), data lineage etc.

I've seen this article Meta Meta Data Data - Ralph Kimball but as I'm not the first person to do this I'm throwing it to the SO community.

The actual question is this: What metadata do datawarehouse developers require to design, develop and manage change in ETL routines?

PS: I'm trying to keep the answer platform agnostic but for context this is an Oracle database with PL/SQL and Datastage.

+1  A: 

At the most rudimentary level you could maintain a listing of tablename.fieldname's that are extracted from the source system. If you are extracting from files or other non-database sources then you may or may not be able to list the dependencies as granular as the field level. Such meta data would tell you only that If(source field/fileFormat/table changes) then (some change may be needed in the ETL process). I say may because some changes may be minor enough that the ETL process is not broken, but one should still perform testing and data profiling to ensure it doesn't invalidate any assumptions that were in place during the initial design of the ETL process.

While the ETL developer may be intimately familiar with the source system and ETL process, for large ETL projects it would probably be wise to tie each dependency in the source system to specific processes/components of the ETL system. For example, if your ETL process is composed of many stored procedures, then you would want the metadata to relate each source system field/fileFormat/table/etc. to each stored procedure. This would be a many-to-many relationship since many stored procedures might depend on a particular field, and a single stored procedure may depend on many source fields. This would be something manually updated by the ETL developers, as they create or update the ETL system, it would be their responsibility to recognize what fields they are loading/cleaning/conforming and subsequently input that information into the metadata that tracks these dependencies.

If your ETL system is powered by something besides stored procedures, or some combination there of, then you will need to come up with some naming scheme to reference these components. But the concept is the same, you relate source fields/file formats/etc. to the components of the ETL system.

This will give you enough information to say that where "Person.FirstName" changes in the source system in some way, then you can compile a report that shows all the stored procedures which will need to be verified, potentially updated, and tested to cope with the change in the source system.

This kind of implies that knowing that Person.FirstName changed in some way, either size, datatype, and/or removed entirely, requires a manual step of being notified of the change via some database designer, and taking action in response. If you want a really sophisticated system then you'd need to have triggers/audit on DDL that changes your source system so that you can automatically log and notify your ETL architects of the change.

Should such a change occur, you'd know that sp_Person_Load, sp_Person_Clean, sp_Person_Transform stored procedures all have some dealings with the Person.FirstName field, because the author of those stored procedures noted that in the metadata documenting dependencies.

You could make it more complicated, where sp_Person_Clean doesn't depend on Person.Firstname, but actually depends on sp_Person_Load. Such that you build a chain of dependencies. This would make reporting on changes more complicated because you would have to chain dependencies together to determine the impact of a source system change. And you are also building a complex string of dependencies, and potential circular references, that might make maintaining the metadata as difficult as maintaining the ETL process itself. If the ETL system is simple enough that the ETL architect can define dependencies in terms of the fields/files/tables from the source system, then do that to keep things simple.

Depending on who has authority over your data warehouse, the destination system, you may need to track these dependencies. Often the ETL developer is also the data warehouse developer. However, if someone else is the data warehouse designer, and they have the authority to make changes to the data warehouse, then your ETL developers will need to have a similar system, whether it be automated or manual, to track and be notified of changes and the impact they will have on the ETL process.

Really when I think about how changes should be tracked, I think about the boundaries of authority. If a ETL developer changes his sp_Person_Clean procedure, he doesn't need emtadata to tell him that sp_Person_Transform will need to be updated/tested. He already knows this very intuitively. On the other hand, if a third party/vendor system changes, or if a business department within the same organization changes a spreadsheet or file format, then those are things not enacted by the ETL developer. He won't have the same level of intimacy with the source systems as he does with his own ETL system and data warehouse. Thus he will benefit the most from metadata that shows how components of the source system relates to components of the ETL system.

Deciding how granular you want to define "components" will really depend on how the systems are designed, and how much meta data you want the developer to document. Too granular and you drown in the metadata. Too course and it is like saying "My house is in Florida", which doesn't really help the ETL developer in his work. If the entire ETL process is coded in a single SQL script, then you only have one component. So the system needs to be designed in advance knowing that you need to be able to refer to specific components and/or steps of the ETL process.

The metadata will only be any good if the developer is diligent in updating it. There are systems/toolkits that can automatically update this kind of metadata, but they would have to box you into their toolkit so that the tool can analyze the dependencies. I have little confidence that these systems are very reliable. There are often times you have to do really hackish things to get data from a source system into the destination in the desired format, and I can imagine a dependency analyzer couldn't understand the dependencies. For example, if you were using dynamic SQL formed from strings, then the toolkit can't really figure out what the dependencies are.

However, I will say I have never dove into great depth into the tools to really know how good they are. I always have gotten to the point that I found things that would normally be easy in SQL to be very cumbersome in the tool, and decided it was more trouble than it was worth. I keep telling myself I will pick something up like Talend and really become an expert at it before I make my final verdict, but there are always other priorities pulling me in other directions.

AaronLS
Thanks, this has given me a good starting point. I will post my conclusions when I've finished.
Paul James
+2  A: 

At my workplace, we have home-brew ETL. I can see you raise a brow :). The minimal meta-data we have describes the following. Subscription details, Audit, Data-mapping, Run-order.

The subscription details again fall into two categories, vendor from whom the data was purchased and teams/applications using it. The ftp/http details, access credentials are also stored. Fortunately we were asked to have absolutely zero SPs, Major exception "identity generators".

Audit details involve, date of data, last modified time, user who ran it, failure/success count.

Data-mapping table describes the tables and column names which hold the data. We used to have an additional composite-key descriptor table. However we decided to do away with that. The performance loss was compensated by asking data table owners to create proper partitioning strategy.

Run_order table is another table we have which determines if the user can run (Y/N) and the order in which runs can occur.

The meta-data is also stored with an history (based on date). So if any one decides run an archived/historic subscription. The run would go ahead.

Uses of the above: We can prioritize the data loads based on importance of subscription. We can monitor failures at a generic level (birds-eye view). We can write generic code that can create dynamic sql queries (no hard-coding). Our load and extract processes are forced to use the data-mapping table, so no user can get away with stale information.

This seemed to work so far in our experience.

questzen
Thanks, good answer.
Paul James