Help with database schema for 50+GB DB

views:

193

answers:

Help with database schema for 50+GB DB

Hi all, I have a task to store large amount of gps data and some extra info in database and to access it for reporting and some other non frequent tasks.

When I recieve a message from gps device it can have variable number of fields. For example

Message 1: DeviceId Lat Lon Speed Course DIO1 ADC1
Message 2: DeviceId Lat Course DIO2 IsAlarmOn
Message 3: DeviceId Lat Lon Height Course DIO2 IsAlarmOn etc. up to 20-30 fields

There is no way to unify number of fields - diffirent device vendors, diffirent protocols etc. And another headache is size of database and necessity to support as much db vendors as possible(NHibernate is used).

So i came to idea to store messages that way:
Table1 - Tracks
PK - TrackId
TrackStartTime
TrackEndTime
FirstMessageIndex(stores MessageId)
LastMessageIndex(stores MessageId)
DeviceId(not an FK)

Table2 - Messages
PK - MessageId
TimeStamp
FirstDataIndex(stores DataId)
LastDataIndex(stores DataId)

Table3 - MessageData
PK - DataId
double Data
short DataType

All indexes are assignet with hilo. Tuned my queryes so Nhibernate can handle incerting 3000+k messages veeeeeery quickly(baching also used). Im happy with perfomance atm. But i dunno how it will work at 50+gb or 100+ gb size.

Will be very grateful for any tips and hints about my issue and storage design overall=)
Thanks, Alexey
PS.Sorry for my english=)

+3 A:

In a nutshell, your application, specifically the heterogeneous structure of the messages received from the GPS devices, pushes your design towards a EAV datastore structure (whereby the Entity is the Message , the Attribute is the "MessageData.DataType" and the Value is systematically a double.)

The Three tables design you outline in the question, however seem to depart a bit from a traditional EAV implementation, in a sense that there is an implicit sequence to the way MessageData is stored whereby all the data points for a given message are sequentially numbered (DataId), and the link from a message to its datapoints will be driven by DataId within a range.
That is a bad idea! Many problems with that, a notable one being that this introduces a unnecessary bottleneck for the insertion of messages, Can't start inserting a second message until all datapoints for the previous message. Another issue is that it makes the relation between message and datapoint difficult to index (underlying DBMS will not be efficient at it).
==> Suggestion: Make the MessageId a foreign key in MessageData table. (and possibly drop the DataId PK in MessageData table altogether, just to save the space, at the expense of having to use a composite key to refer to a particular record in this table, for example for maintenance purposes)

Another suggestion is to store the most common attributes (datapoints) at the level of the Message table. For example, Lat and Long, but maybe also Course or Some alarms etc. The reason for having this info right with the message is to optimize queries to the data (limiting the number of self joins necessary with MessageData table.

Since both the Messages and the MessageData tables may not contain part of the message, you may also want to rename the latter MessageDetail table, or some such name.

Finally, it may be a good idea to allow for data values other than these of the double type. I anticipate some of the alerts are merely boolean, etc. Aside from allowing you accept different kinds of datapoints (say short error message strings...) this may also give you the opportunity to split the datapoints over multiple "detail" tables: one for doubles, one for booleans, one for strings etc. This way of doing complicates the schema in a sense that you then need to build some of these details into the way the queries are produced, but it can provide some potential for performance / scaling gains.

mjv 2009-10-23 16:19:13

Thanks a lot=) Ill try to create schema using your suggestions=) And post about results later after some tests.

Alexey Anufriyev 2009-10-23 19:47:17

Ill try to describe how it works now more detailed in answer, because comments have fixed length=) Here is recieve sequence:
1. Service recieves messages from MSMQ(number of messages can differ-atm it uses 500 messages bulk packet).
2. Then refines distinct device Ids.
3. For each device id it uses MS EntLib isolated storage cache with structure:
DeviceId --> List where DeviceId is lookup key.
4. If we have more then 1k messages in cache - write them into db in one sequence and after write "index" to lookup table:
Index: id
serial_id
index_start_datetime
index_end_datetime
index_first_dataid
index_last_dataid
5. Cleans cache for this DeviceId

Also i store data in couples: id data1 data2 type
for example lat lon, speed course, adc1 adc2, dio1,dio2 and if there is no coupled value: value 0

I choose double because i can store every type of data devices send in it. The dont send strings, but most of em are csv style like 1,0,23,50.0000N30.00000,1,2,12,0,1,2 etc. Even alarms and etc have same type of data. When I need to get some data i just find indexes for given datetime window and DeviceId and get actual data knowing when it starts and ends. And there is no complex queryes. Just 2 simple ones. Other code is interpreting this using some protocol "mappings". Thanks for EAV tip. I think it fits well. First table Track is for agregating messages and getting em quickly in retrival algorithm i described couple strings before.

Alexey Anufriyev 2009-10-23 20:39:20

Alexey Anufriyev 2009-10-27 11:07:09

Hi I'm writing similar application. I suggest to recognize all possible values from vendors and create proper schema with all necessary fields. Thanks to this you can write performant/simplest reporting queries.

Besides you can create fields that contain specified (length) data, which means you can save place and improve performance.

I have one vendor with known values so I created one table for this. This table can be easy partitioned by native MS SQL Server mechanism.

So, my simplest situation allows me to write one stored procedure to saving data. There is no NHibernate, just pure ICommand.

Rest of application use NHibernate.

dario-g 2009-12-27 22:51:23

ansaurus

tags:

views:

answers:

Help with database schema for 50+GB DB

related questions