views:

142

answers:

4

I have a data transformation product, which lets one select tables in a database and transform row data in the source database to a destination database.

This is handled in the current product (java based workbench and engine) by processing 1000 rows at a time and doing it 10 threads parallely. This approach works fine on smaller data sets. But, when I have to transform huge data sets (say about X million records) at one time - this approach still works, but

  • The host machine's CPU on which my product runs, is under heavy load.
  • The source database and the target database are punched with too many transactions that they start slowing down. (Now, this could be attributed to the fact that the database server is probably running on a slower piece of hardware.)

I started looking for solutions and I was quick to address this by requesting a hardware "beef up" on the source/destination database server machines. This involved, say, buying a new multi-core CPU and some extra RAM. Turns out, upgrading hardware wasn't just the only concern: multiple software licenses for the database were required to be procured - thanks to multi-core processors (per core license).

So, the ball is in my court now, I will have to come up with ways to address this issue, by making changes to my product. And, here is where I need your help. At this moment, I could think of one possible approach to handling huge loads:

Approach1

  1. Reading data from source database, persisting it to a temporary medium (file).
  2. Transform data in persisted file, by running it in a distributed environment (cheaper single core machines), there by handling the "trade-off move" of switching to file persistence. (Using something like Apache Hadoop to handle the distributed computing part)
  3. Writing data to a destination database.

This is all I could come up with for now, from an architectural perspective. Have you handled this situation before? If yes, how did you handle it? Appreciate your suggestions and help.

A: 

The first thing to think about here is if you really need transactions for this amount of data. If the answer is no, your database product likely has a bulk insert option which is made for this kind of large database insertion.

Edit (further to the comments): I think the most bang for your buck (in SQL Server anyway) would be to set the target database to simple recovery mode for the duration of the operation. In fact, if you did that, it is likely that you would not have to make any other code changes.

However, that is only appropriate if the target database is not being used for other things at the same time. I would say that is a basic requirement. It is a fundamental database mistake to try to insert 25 million records into a database while it is live with OLAP transactions. If that is strictly necessary, then I think the solution is to make the process very slow (with significant pauses) so as to free up resources so that the databases can keep running.

Yishai
@Yishai you are suggesting that if I were to ignore transactions, and find out a way to utilizing bulk insert option that is usually made available for use by the database, it would not affect load on the database? So, if you were to insert 25 million records, using an import utility, your database performance to all the requests that it is currently serving, would be unaffected? If yes, do you have any references available in this context?
Jay
I wouldn't say that they are unaffected, but the operation is optimized for what it is doing. I see you are trying to use both Oracle and SQLServer. You can try the JDBC support for bulk operations, but I'm not sure exactly what the driver does. Here is some documentation for optimizing bulk inserts in SQL: http://msdn.microsoft.com/en-us/library/ms190421%28v=SQL.105%29.aspx
Yishai
thank you for the link, i will check it out.
Jay
A: 

have you benchmarked it using smaller sized transactions? otherwise i wouldn't use transactions for this. from your licensing issue sounds like you are using either oracle or sql server. both of them have a bulk insert ability, that would be a better match for this than transactions.

scphantm
@scphantm no, I haven't benchmarked it, or rather, it has been benchmarked, its just that I dont have the stats right now. Yes, I have been testing my product both against oracle and sql server. I am aware of the fact that they have import/export utilties, but the question is, how effective are these?
Jay
they can be very effective. i used them with an sql server years ago to move and process millions of records into a reporting database every night. process took about 3 hours but there was a huge amount of preprocessing that had to be done on each record.
scphantm
+2  A: 

There are a couple of things that you could do without increasing your database license cost:

  • Your tool is putting the CPU under a heavy load, assuming that your tool is running on a machine that is not running a database, increase the CPU power on that machine, or if your tool allows it run it on serveral machines.
  • One of the reasons that the number of active transactions goes up is that each individual transaction takes time to complete. You can speed this up by optimising your disks or putting in faster disks.

Also if you are using insert instead of bulk insert there is a massive improvement potential. The problem with normal insert is that it writes information to logs so that it is possible to rollback the transaction.

In this case I was able to help someone reduce the load time from 10 hours to 6 minutes :)

Shiraz Bhaiji
+1 definitely bulk insert if its not being used can be a huge gain.
eglasius
@Shiraz thanks you for the link to the case-study. My intention in asking this question was to collect all possible solutions to my problem. I don't want you to worry too much about the database transactions as bottle neck. If you could consider distributed computing as one of the possible solutions, would you vote for such a solution? If no, what are the possible non-database ways to this problem?
Jay
Instead of using a local relational database, you could use a cloud name value pair data store, but then your bottleneck would be your internet connection. You could also then start many machine instances when you run the import and shut them down when the import is finished.
Shiraz Bhaiji
I guess, using a cloud name value pair data store, will adversely affect performance. And yes, I would like to start many machine instances - but how would one manage all these instances? Do you think Apache Hadoop fits in, in this context?
Jay
+1  A: 

Divide and conquer!

If the source DB can't handle both jobs at once (the ETL and "regular" transactions) then don't make it suffer:

  • Copy the source data to a "mirror".
  • Perform the ETL on the "mirror".

NB - when I say "mirror" I simply mean a copy that allows fast and efficient copying of the data (a bit like a "staging" DB) - not another big/slow/nasty ETL process. The idea here is to optimize the process to benefit the source DB.

Then, you can optimize the ETL to the target DB to benefit the target DB; because you've teased the source and target apart it will be easier to optimize the read / insert portions of the overal process.

You could probably do a similar thing at the target end as well (using another "mirror" / staging DB).

This approach isn't that different from what you've suggested but I'm assuming that at straight copy of data between two identical db's of the same type will be both the easiest to manage and the most efficient.

After that you can start applying some of the other suggestions that others can put forward.

One last thing - you could experiment with the ETL tool - if you're running

Adrian K