views:

141

answers:

1

I'm trying to set up an automated process to regularly transform and export a large MS SQL 2008 database to MongoDB.

There is not a 1-1 correspondence between tables in SQL and collections in MongoDB -- for example the Address table in SQL is translated into an array embedded in each customer's record in Mongo and so on.

Right now I have a 3 step process:

  1. Export all the relevant portions of the database to XML using a FOR XML query.
  2. Translate XML to mongoimport-friendly JSON using XSLT
  3. Import to mongo using mongoimport

The bottleneck right now seems to be #2. XML->JSON conversion for 3 million customer records (each with demographic info and embedded address and order arrays) takes hours with libxslt.

It seems hard to believe that there's not already some pre-built way to do this, but I can't seem to find one anywhere.

Questions:

A) Are there any pre-existing utilities I could use to do this?
B) If no, is there a way I could speed up my process?
C) Am I approaching the whole problem the wrong way?

A: 

Another approach is to go through each table and add information to mongo on a record by record basis and let Mongo do the denormalizing! For instance to add each phone number, just go through the phone number table and do a '$addToSet' for each phone number to the record.

You can also do this in parallel and do tables separately. This may speed things up but may 'fragment' the mongo database more.

You may want to add any required indexes before you start, otherwise adding the indexes at the end may be a large delay.

Amala
I like that idea! Based on your suggestion I'm creating an SSIS destination component that creates JSON appropriate for upserting with mongoimport (which I assume uses $addToSet). The idea is that to update one collection (e.g. the customers collection), I'd end up with several json files (for phone numbers, addresses, demographic info) then mongoimport them one by one.
Mongoimport will not use $addToSet. That is a update operator. You will have to write custom code to do it. That is the downside of my approach is that it will require custom code. But it should be very easy. I would suggest using your favorite programming language and simply going through each table and doing the update using the upsert. So you can do an upsert with $addToSet for every single record. I honestly don't know if it will save time. If it will be automated will you do only diff's? You could do some sort of trigger.
Amala
I made a custom importer that, given a JSON file, $addToSet's each collection and $set's each property (using $findToModify). Completely unoptimized the initial version is pretty slow when tested with 4,000,000 records (whole process takes around an hour (!)). I think I'm going to definitely have to sync only diffs (Probably using triggers and timestamp fields.) Before knowing anything for sure I need to optimize all this. Interestingly it seems like the current bottleneck is CPU (100% when running this)... makes me pretty sure I have some issues in my code, heh.
What is $findToModify? That is just a standard update right, you can do the update with only the ID as the query object. We also have some slow processes. We modify 6 million records once a month and it takes 18 hours. Because there is a lot of field by field logic that I have just left to Mongo since the time is not critical. You may not need to optimize much if you are doing diffs. Also you should accept an answer on this question.
Amala
Oh, new to the site sorry-- accepting the answer now. I think findToModify is the underlying db operation. I'm using a C# lib that didn't have explicit support for modifiers on updates (that I could see). It will allow you to manually send commands though. Command ends up being findToModify { "collection" : "customer", { "_id" : 2 }, { $addToSet : { $each : [ {address fields...},...] }}}I'm hoping that just updates the record.Even with some minor adjustments it's already running much faster -- and with diffs total records will be small. Thanks again!
the C# does have an update: http://github.com/samus/mongodb-csharp. I think the findToModify you are using is this:http://www.mongodb.org/display/DOCS/findandmodify+Command Which is some sort of query with atomic update. That is probably slower than a simple update. The update command takes two basic operators, the query and the update. The query matches the document and the update is the $set or $addToSet operator.
Amala
Yeah you're right. For some reason when I started using mongodb-csharp library it didn't dawn on me that I could pass modifiers as part of the update document (to collection.Update()). Rookie mistake I guess. Exports now run 4 times faster using simple update operations.