tags:

views:

39

answers:

4

Hi, I have a question about table design and performance. I have a number of analytical machines that produce varying amounts of data (which have been stored in text files up to this point via the dos programs which run the machines). I have decided to modernise and create a new database to store all the machine results in.

I have created separate tables to store results by type e.g. all results from the balance machine get stored in the balance results table etc.

I have a common results table format for each machine which is as follows:

ClientRequestID PK SampleNumber PK MeasureDtTm Operator AnalyteName UnitOfMeasure Value

A typical ClientRequest might have 50 samples which need to tested by various machines. Each machine records only 1 line per sample, so there are apprx 50 rows per table associated with any given ClientRequest.

This is fine for all machines except one!

It measures 20-30 analytes per sample (and just spits them out in one long row), whereas all the other machines, I am only ever measuring 1 analyte per RequestID/SampleNumber. If I stick to this format, this machine will generate over a miliion rows per year, because every sample can have as many as 30 measurements. My other tables will only grow at a rate of 3000-5000 rows per year.

So after all that, my question is this:

Am I better to stick to the common format for this table, and have bucket loads of rows, or is it better to just add extra columns to represent each Analyte, such that it would generate only 1 row per sample (like the other tables). The machine can only ever measure a max of 30 analytes (and a $250k per machine, I won;t be getting another in my lifetime).

All I am worried about is reporting performance and online editing. In both cases, the PK: RequestID and SampleNumber remain the same, so I guess it's just a matter of what would load quicker. I know the multiple column approach is considered woeful from a design perspective, but would it yield better performance in this instance?

BTW the database is MS Jet / Access 2010

Any help would be greatly appreciated!

A: 

You can decouple the AnalyteName column from the 'common results' table:

-- Table Common Results

ClientRequestID PK SampleNumber PK MeasureDtTm Operator UnitOfMeasure Value

-- Table Results Analyte

ClientRequestID PK SampleNumber PK AnalyteName

You join on the PK (Request + Sample.) That way you don't duplicate all the rest of the rows needlessly, can avoid the join in the queries where you don't require the AnalyteName to be used, can support extra Analytes and is overall saner. Unless you really start having a performance problem, this is the approach I'd follow.

Heck, even if you start having performance problems, I'd first move to a real database to see if that fixes the problems before adding columns to the results table.

Vinko Vrsalovic
+1  A: 

Millions of rows in a Jet/ACE database are not a problem if the rows have few columns.

However, my concern is how these records are inserted -- is this real-time data collection? If so, I'd suggest this is probably more than Jet/ACE can handle reliably.

I'm an experienced Access developer who is a big fan of Jet/ACE, but from what I know about your project, if I was starting it out, I'd definitely choose a server database from the get go, not because Jet/ACE likely can't handle it right now, but because I'm thinking in terms of 10 years down the road when this app might still be in use (remember Y2K, which was mostly a problem of apps that were designed with planned obsolescence in mind, but were never replaced).

David-W-Fenton
A: 

Thank you both for your insights! On the dbase backend point, I will look into other options. I am pretty sure I can migrate to SQL Server without too many problems. As far as the de-coupling goes, I probably didn't explain my structure well enough in my original post. The structure (at present) is as below:

ClientRequestID PK SampleNumber PK MeasureDtTm Operator AnalyteName UnitOfMeasure Value

Every sample has a range of different analytes being measured, so analyte,unitofmeasure and value always go together. So in a small 5 sample clientrequest, sample 1 might have 5 different analytes (e.g. K,mg,Ca,Cu,Fe - which may or may not share the same unit of measure), sample 2 might have 20 different analytes to measure....sample 3, 15 analytes.... and so on. None of these fields actually store textual data - all values are integers, which map back to the relevant codes/measures.

If I process a large client request - say 100 samples, with an average of 20 analytes being measured per sample, this equates to 2000 rows for this 1 clientrequest alone.

The max number of measurable analytes is fixed, and cannot exceed 30. So my choices are to keep the design as-is, which may be fine as only a few integer values are being stored (albeit in a large number of rows), or add in a bunch of columns to represent the max number of analytes - this would result in 100 rows rather that 2000 rows for the above clientrequest. Data are always reported on/queried by clientrequestid.

If performance is not an issue with 500k to 1M rows per year, then I will just keep it as is.

Once again, thanks for taking the time to respond to my question!

james
It kind of depends on what you're going to do with the data. If you want to do your analysis in SQL, you'll want a normalized structure. If, on the other hand, you're just use your database as a bucket to hold the data that you'll then output into a format for, say, a statistics package to evaluate, storing it in a non-normalized structure shouldn't be an issue.
David-W-Fenton
A: 

Thanks David, that makes sense. All the best.

james