views:

42

answers:

1

Imagine collecting all of the world's high-school students' grades each month into a single table and in each student's record, you're required to include the final averages for the subject across the student's class, city and country. This can be done in a post-process, but your boss says it has to be done during data collection.

Constraint: the rows are written to a flat file then bulk-inserted into the the new table.

What would be good strategy or design-pattern to hang on to the several hundred thousand Averages until the table is done without adding excessive memory/processing overhead to the JVM or RDBMS? Any ideas will be helpful.

Note: Because the table is used as read-only, we add a clustered index to it on completion.

+2  A: 

I'd tell my boss to stop micromanaging.

But seriously, sort the data by class, city, and then country. Then compute the running average for each by keeping a running total and count for class, city, and country. When you encounter a different class, write the class name and average to a file. Do the same for cities and countries only use different files for each. Then you can open the sorted data file and the average files and insert rows in the database one by one.

If you want to use a framework that will handle all the writing to disk, I would look into using Hadoop for the processing.

Jon Snyder
Good answer, but I forgot an important constraint - the flat-file/bulk-loading issue. I've updated my question.
belwood
@belwood I've modified my answer to include this contraint
Jon Snyder
Hadoop looks very interesting - thanks for the tip.
belwood