views:

52

answers:

4

I’ve been tasked with a implementing a system for continuously receiving large amounts of data from a partner. The data schema on the partners’ side is different from ours, so some transformation has to take place when the data is received and imported into our database.

What’s the best approach to solve this task?

I have some control over both sides of the equation, i.e. if we chose to use WCF, our partner will implement that service. But it’s important to limit the amount of programming that has to take place on the partners’ side as much as possible. Ideally they would use some feature of SQL Server to export their data and then let us deal with the rest.

Other relevant factors:

  • SQL Server is used on both sides (2008 at our side, unknown for the partner).
  • .NET 4 and/or anything that comes out of the box from Microsoft, no 3rd party products.
  • Transfer is one way, i.e. from the partner to us only.
  • Amount of data is around tens of thousands of updated records/objects transferred daily.
  • The data model is fairly complicated, with several tables and relations.
  • The solution should be tolerant to changes on the partners side (we don’t know when & how they change their model)
  • Maintainability and reliability are more important than performance, i.e. we don't need the latest data, but the service should be easy to modify and it can't fail/stop.

I’m asking this question because I overwhelmed but the number of ways this can be accomplished. As a .NET developer, I’m leaning towards implementing WCF or maybe even some custom xml-based exchange. But I know Biztalk would probably fit the bill here, and SQL Server probably has some nifty features built in that I don’t know about.

Any pointers, ideas and suggestions are most welcome!

+2  A: 

If you'd like a complete copy of their database, then log shipping is a good option, otherwise check out replication which gives you finer-grained control over what's copied.

You can copy across the data as-is and deal with the transformation logic on your side of the wire.

Will A
I think a more service-oriented architecture is what we're looking for: If we're going to maintain a complete copy of their database, it also means we'd have to make changes when they make changes to their model. There has to be some kind of middle-contract, that defines the data and its structure.
Jakob Gade
Both log shipping and replication will apply schema changes - so that might not be a breaking issue.
Will A
Yeah, but then we'd have a database that "magically" changes overnight, possibly breaking our import/transformation code. :)
Jakob Gade
True - in which case you'll definitely need a contract between yourselves and the remote client. In this case, SSIS (as suggested by Vidar) is probably going to be a great choice for you in consuming the data, transforming it, and populating your local table(s).
Will A
+2  A: 

In addition to Will's suggestions, you can take a look at SSIS (SQL Server Integration Services). With this you can export to an FTP site and then import it on the other side.

Vidar Nordnes
A: 

Google SQL Server ETL.

vulkanino
Do you have a link? Google comes up a bit wague on this topic. But I don't think we'd want to take in a dependency on any Google library for this project.
Jakob Gade
@Jakob: vulkanin was suggesting you Google "SQL Server ETL", not use a Google library that performs ETL.
Will A
@Will: D'oh, didn't catch that at all. What a difference a few quotes can make. I'm an idiot. Thanks. :)
Jakob Gade
sorry for being so unclear! I use "google" as a verb which of course is not. :)
vulkanino
A: 

Probably, here is how I'd do it:

  1. Maintain a local copy of your partner's database. Let us call this staging because you'll transform the data here. Use Data Compare (VS 2010 feature which you can ofcourse automate) or Replication to achieve this. I'm not sure if Replication triggers CDC, but Data Compare sure does.
  2. Enable Change Data Capture on the staging database. This gives you only the changed records, so you can selectively transform the data and update your database.
  3. Build and deploy an SSIS job that scans the CDC tables, transforms the data and loads into your database.

As far as I know, Data Compare works well when the schema is changed (it ignores the schema changes). CDC also works well, all you need to do is create a second capture instance if there is a schema change and remove the first one.

Josh