views:

65

answers:

2

Hi,

I have a C# application that downloads a list of .xml files from an on-line data warehouse. This application dumps the files into a local directory and it takes roughly 1 hour before all 10k files have downloaded. This is a daily process.

I need to take each of these files and extract, transform and load the contained data to a database. I would like to do this in parallel with the download as I don't want to wait until all files are downloaded before I commence the ETL process. Unfortunately, the XML files contain large quantities of data, so I can ETL about 10 files at a time. What's a good strategy for achieving my parallel loading requirements?

A: 

You can optimize your situation with some thread pools.

First add all of the files to be downloaded to a queue which is protected by synchronization.

You would have a thread pool for downloading the files, when a file is about to be downloaded, you remove it from the list of files to be downloaded. After you successfully download the file, you add it to a another queue of work which is to be processed. If there is an error of some kind, you can re-add it to the queue of files to be downloaded. Each thread would end itself if there are no more files to be downloaded in the queue.

While that is running, you would have another thread pool for processing the actual XML files, the thread pool of workers would take from that queue of already downloaded XML files. Each thread would end itself if there are no more downloaded XML files to be processed AND if the other thread pool is already finished.

Make sure you take care of synchronization considerations on the queues (example: protect with a mutex for insertion, removal, ...)

By using thread pools you can set how many threads to use without affecting the program logic. You would determine the best value based on how many resources you want to take and other considerations like too many thread pools has no benefit and just focuses the CPU too much on task switching.

Brian R. Bondy
A: 

If that is too complex for your needs, you might want to look into Parallel.ForEach / Parallel.For. Also, the new Task class (TaskFactory.StartNew(...)) and continuations (e.g. download finishes and then goes to a processing function).

Chad