views:

41

answers:

1

I'm looking for ideas on how you'd architect a system like so:

Records come in in bulk (say 100,000 at a time) from a variety of sources but primarily a flat text file.

This data needs to be shoved as-is into a SQL Server database table. However, various metrics need to be computed. For example, one field is a certain 4-digit code. Only certain 4-digit codes are valid and we need to track how many records arrived with bad 4-digit codes. There are other fields that need to be "validate" and the list of fields could change in the future.

What is a good design for such a system? Is it best to have events BadFourDigitCodeEncountered and event processors OnBadFourDigitCodeEncountered or is there a cleaner design that is easily maintainable going forward?

(I don't think it should matter, but I am using NHibernate as my ORM but maybe that is useful to know since NHibernate has various points to hook into?)

I should mention: using C# .NET 4.0.

Thanks in advance, Arlen

+1  A: 

For most high-capacity file-to-database processes, I would architect it as an ETVL (extract-transform-validate-load) workflow.

Extract: Open the file, get the rows of data and put them in a queue to be handled by the transform layer.

Transform: Grab the raw record data, chop it up into fields you care about and create a new domain object with the field data. Then this object goes in a queue to be handled by the validate layer.

Validate: Run your domain object through a series of business rules designed to ensure that the record is in a valid, consistent state. Valid objects are marked as such (either by placing them in a "good" queue or by wrapping them in a simple class holding the object and a flag before putting them in a queue) and placed in the last queue for the loader. You can calculate your metrics here per batch, or you can get the metrics real-time by placing "failed" records in another table, with an error code describing what's wrong, and querying the numbers and causes at your leisure for one batch or many.

Load: Persist the domain objects to your system's database.

Each of these stages should be separate methods or even classes, managed by a "supervisor" process. The beauty of this design is its scalability; if you end up with a lot of validation or transformation logic that slows the process down, you can very easily modify the supervisor to multithread those stages, adding extra processor power where you need it. It's also modular; if the file format changes, you only have to change the transform stage of the process (maybe the extract if the change is radical enough). If the persistence mechanism changes, you just pop in a new Load layer. Depending on the complexity of your object graph, and thus the complexity of the Transform and Validate stages, I think you'll find this to be well-able to handle a hundred thousand records at a time.

KeithS
Wow. This very different than what I was thinking, but seems like a very good idea. I would give you a vote if the system would let me (I need points?). One point I'm confused on is how in this system to keep track of say the total number of records that had some field invalid and how to get the information out of the system and say logged in a database table or emailed out to the necessary people?
Arlen
Well, your validation stage would either report "valid/not valid" to its supervisor (or to some metrics generation class), or it would tell the loader to save valid ones in one table, and invalid ones to a second one with information on why it failed. Then, you can query the "failure" table to get the number of records that failed because of a certain problem.
KeithS