views:

257

answers:

8

A niave beginners question about database design.
I have an app managing some logger data eg. 1000s of sequential measurements of time, voltage, current, temperature. In addition each sequence run has meta data (date, location, etc).
So I need a table for each set of measurements and a master table listing these tables and the meta data for each.

A couple of questions:
This doesn't really use the fact that all the data tables are the same format - there is no 'array of tables' concept, does this mean anything?

Would I just give each data table a unique name, put that in the main table as a column, then simply substitute it into the SQL select statement - or is there a better way?

edit: The reason for many tables, one per run, is that there might be 10-100 runs each with many 1000s of measurements. If I want to display/query/delete just one run it seemed more natural to have each in it's own table.
It looks like this is a programmer mindset (everything should be collections of separate objects) the database approach seems to prefer - store everything together and use index/cross-references.

A: 

If a single measurement has all of the values (time, voltage, current, temp) then you'd use a field for each value in one table.

If they are not related, you're better off simple having a value and a value-type indicator (1=voltage,2=current etc).

If they are related, and you want to handle the metadata, you may want to have a generic metadata field, and a metadata type field, depending on the nature of your metadata.

For argument's sake let's assume your metadata is simply the location. You'd set up a table of locations, then put the appropriate locationID in the metadata field. (make sure the type in INT if you're putting in the locationId so you can join the tables later)

You would typically use a look-up table for the type indicators and use a view to make your ports by joining those fields.

Diodeus
+7  A: 

You'd have one table (Run) that stores date/time/location and an integer primary key (auto increment).

You'd have another table (Measurements) that stores time/voltage/current/temp, an integer primary key (auto increment), and a FK to Run (just a not-null integer column called RunID).

So 2 tables total.

colithium
A: 

If I understand your question, I would probably use two tables.

'TestInfo'. Each record Contains date, etc that the test was run. This is your metadata. Additionally I would add in a generated 'RunId' which is a unique number.

'RunInfo' Each record contains the runId, and the current, temperature, time, etc information.

The RunId ties the data back to the proper test run for you, and you don't have to create a bunch of tables. Additionally you can do queries into your runinfo accross any test.

Now - depending on your use case, this may not be ideal, but it is a good place to start.

John Chenault
A: 

What I would do is have one unique id (that auto increments) in the meta data table, and use that ID to group together the measurements and meta data.

+-------------------+     +----------------------+
|    Meta Data      |     | Measurements         |
+-------------------+     +----------------------+
|  metaID (pri Key) |---->| metaID <-(index this)| 
|  location         |     | measureID   (pri Key)|
|  metaDate         |     | voltage              |
+-------------------+     | amperes              |
                          | current              |
                          +----------------------+

You'd have a bunch of measurement records with the same metaID (but they'd have their own id) and they would each match up to the meta data.

Grant
Would that be less efficent than each set of measurements in it's own table? I am assuming all queries would only apply to a particular run.
Martin Beckett
Having each set of measurements in a different table sounds like a pain. If there is an index on the measurement's metaID column, then the results are going to be instant either way. Unless you have a ton of people reviewing the data at once.
Grant
+2  A: 

You are asking that question with the assumption that you will need more than the two tables all answers so far are advising you to create.

You state however that this is a question about database design, and the proper design is indeed using those two tables. Partitioning of the data is something you would do later, if it is absolutely necessary, and most probably once you are way past those n x 100 runs x m x 1000 measurements.

The table with the runs will have not many rows. The table with the measured data will have many rows, but with the key referencing the runs table you can create indices that will speed up your database operations. Your application design will be much easier OTOH.

mghie
Yes - thats why I asked here, as a programmer I naturally think in 'collections of separate objects' which isn't necessarily the correct approach to a dB
Martin Beckett
+2  A: 

I'd do the following (thanks Grant for the table design):

+-------------------+     +----------------------+
|    Sequence       |     | Measurements         |     +------------------+
+-------------------+     +----------------------+     | Measurement Type |
|  metaID (pri Key) |---->| metaID               |     +------------------+
|  location         |     | typeID               |---->| typeID (pri Key) |
|  other metadata   |     | measureID   (pri Key)|     | typeName         |
+-------------------+     | value                |     | typeUnits        |
                          +----------------------+     +------------------+

So the types table would have:

typeID      typeName     typeUnits
0           Volts        V
1           Current      A

The Measurements table would have one row for each measurement.

So one sequence row would correlate to several measurement rows, each of which would specifiy what type of measurement it was.

This would enable you to easily have one test run with volts, and current, and another test run with volts1, volts2, current, resistance without changing the table and all the queries that go with it.

The reason for many tables, one per run, is that there might be 10-100 runs each with many 1000s of measurements. If I want to display/query/delete just one run it seemed easier to have each in it's own table

Well, it's not really any easier because at some point you have to decide which table you want to view and put that table name into the SQL statement. It's just as easy to have them all in one table, and instead of putting the table name into the SQL request, put the sequence name as a WHERE sequence="sequencenumber". This is just as easy and shouldn't be any slower, even with thousands of entries. In some SQL servers you can make views as well, which are sort of like a "fake" table. It looks like a table, but it's really an embedded SQL statement - you can create a new view for each sequence if you like, and then you can manipulate them as if they were their own table.

However -
It really doesn't matter which way you go. If you're happy with a table for each sequence, go ahead and do it. It's just a database, and if/when you get to the point that you understand and need more complex layouts it's not hard to convert an existing database into a new, more flexible layout.

Make it easy for now, and as you learn you can adapt it as needed.

Adam Davis
Let's not keep ascending the ivory tower of normalization just yet... Let's get this guy to at least see the benefit of gathering all his data in one table first!
Toybuilder
Well... I _did_ leave out a table or two that I would have otherwise added already, but I suppose a simpler database design is called for... Maybe... ;-D
Adam Davis
That sounds like a lot of expensive joins to extract the data. It's possible in future that more measurement parameters could be added but I can either expand the single table with new columns (and use NULLS) or have a different tables for each measurement and query the schema for the cols
Martin Beckett
Databases are optimized for joins, so they aren't expensive. But the key here, which you haven't described, is how you're going to be using the data, and what you want out of it. The database design nearly completely depends on how you're going to access it, so this may not be the best option.
Adam Davis
Mostly an entire run will be extracted, plotted and some processing done. I was interested in having the ability to do queries on the data (points with this condition, sub-samples etc) for free using SQL rather than having to write extra code.
Martin Beckett
Ah, so you're using it as essentially a pumped up version of a spreadsheet. In that case, toss it all in one database.
Adam Davis
A: 

Speaking very generally, the two-table approach appears to be the most appropriate for your application.

To assure good performance, make sure you create indexes on your tables. Taking Grant's schema, for example, you will want a index on the Measurements table, keyed on the metaID (primary key), and measurementID (secondary key).

A few thousand measurements per experiment set is not a lot of data - so with a proper index and correctly written query, your database will effectively only read the disk for the data rows for the desired experiment.

Because you're likely to load the table with experiment data from one experiment at a time, there's already a high degree of locality of related data -- so you won't benefit from separating the data into separate tables.

Finally, you might want to learn about creating table views -- that will give you the ability to create virtual tables, with each view representing only the data from one experiment. You can create the view on demand, and then drop the view after you are done.

Toybuilder
Thank you - It seemed better from Normalisation that everything the same shape was in the same table, but from an OO point of view it seemed they are independant. I hadn't appreciated that the DBM could of course efficiently access blocks of contiguous rows.
Martin Beckett
+2  A: 

BTW, I want to address the question of whether it's a pain to manage all the data in one table, versus having multiple tables.

Starting assumption: you're going to want to at least have a meta-data table to store the experiment meta-data, and one-or-more table(s) to store the measurements.

If you take the "I want a table for each measurement set" approach, you'll need to know the name of the table. So your metadata table will look something like:

Metadata
----------
experimentName
experimentLocation
experimentDateTime
...
measurementTableName

And a table for each experiment, each looking the same:

Measurement0001    Measurement0002    Measurement0003    Measurementt0004 
--------------     --------------     --------------     -------------- 
measurementNum     measurementNum     measurementNum     measurementNum 
voltage            voltage            voltage            voltage        
amps               amps               amps               amps

Then in your program, you first need to fetch the metadata:

select * from Metadata where experimentName='Johnson Controls 1'

which will, amongst other things, return measurementTableName for the experiment of interest (say Measurement0002). Then, your next query is:

select * from Measurement0002 order by measurementNum

Now, let's take the "store all the data in the measurement table" approach. Your metadata table will look like:

Metadata
----------
experimentName
experimentLocation
experimentDateTime
...
measurementSetID

And a table for all experiments.

Measurement
--------------
measurementSetID
measurementNum
voltage            
amps

Then, just as before, you first need to fetch the metadata:

select * from Metadata where experimentName='Johnson Controls 1'

which will, amongst other things, give you the measurementSetID which you will use to get only the the measurements from the desired set (let's say id=2). Then, your next query is:

select * from Measurement where measurementSetID=2 order by measurementNum

So let's compare the two approaches... The metadata queries are essentially identical. The only difference is that in the first case, you are retrieving the measurementTableName for the desired experiment table; while in the second case, you are retrieving the measurementSetId for the desired entries in the measurements table.

The measurement-fetching queries are essentially the same, too:

select * from Measurement0002                      order by measurementNum
select * from Measurement where measurementSetID=2 order by measurementNum

The only variable is either the table name, or the measurementSetID:

select * from <tablename>                                order by measurementNum
select * from Measurement where measurementSetID=<setID> order by measurementNum

The query result is almost the same, the second approach will just give you an extra column in your data (the measurementSetID) which you can ignore, because all returned rows will have the same set ID.

Toybuilder
Thanks,so in practice it probably doesn't make a lot of difference. It is interesting the difference between a C++/Java programmer approach of keep everything in layers of objects and the DB approach of throw everything together and index/x-reference it.
Martin Beckett