Strategies for designing a database (being accessed by hibernate) which will have alot of archivial data.

views:

127

answers:

+2 Q:

Strategies for designing a database (being accessed by hibernate) which will have alot of archivial data.

I am developing an application which will be integrated with thousands of sensors sending information at every 15 minute interval. Let's assume that the format of the data for all sensors is same. What is the best strategy of storing this data so that every thing is archived (is accessible) and does not have a negative impact due to large size of growing data.

Th question is related to general database design I suppose, but I would like to mention that I am using Hibernate (with Spring Roo) so perhaps there is some thing already out there addressing it.

Edit: sensors are dumb, and off the shelf. It is not possible to extend them. In the case of a network outage all information is lost. Since the sensors work on GPRS this scenario will be some what unlikely (as the GPRS provider is a rather good one here in Sweden, but yes it can go down and one can do nothing about it).

A queuing mechanism was foremost in consideration and spring roo provides easy to work with prototype code based on ACTIVEMQ.

+1 A:

Let's assume you have 10,000 sensor sending information every 15 minutes. To have better performance on database side you may have to partition your database possibly by date/time, sensor type or category or some other factor. This also depend on how you will be query your data.

http://en.wikipedia.org/wiki/Partition_(database)

Other bottle neck would be your Java/JEE application itself. This depends on your business like, are all 150,000 sensors gonna send information at same time? and what architecture your java application gonna follow. You will have to read articles on high scalablity and performance.

Here is my recommendation for Java/JEE solution.

Instead of single, have a cluster of applications receiving the data.

Have a controller application that controls link between which sensor sends data to which instance of application in the cluster. Application instance may pull data from sensor or sensor can push data to an application instance but controller is the one who will control which application instance is linked to which set of sensors. This controller must be dynamic such that sensors can be added or removed or updated as well application instances can join or leave cluster at any time. Make sure that you have some fail over capability into your controller.

So if you have 10,000 sensors and 10 instances of application in cluster, you have 1000 sensors linked to an application at any given time. If you still want better performance, you can have say 20 instances of application in cluster and you will have 500 sensors linked to an application instance.

Application instances can be hosted on same or multiple machines so that vertical as well as horizontal scalability is achieved. Each application instance will be multi threaded and have a local persistence. This will avoid bottle neck on to main database server and decrease your transaction response time. This local persistence can be a SAN file(s) or local RDBMS (like Java DB) or even MQ. If you persist locally in database, then you can use Hibernate for same.

Asynchronously move data from local persistence to main database. This depends on how have you persisted data locally. If you use file based persistence, you need a separate thread that reads data from file and inserts in main database repository. If you use a local database then this thread can use Hibernate to read data locally and insert it on main database repository. If you use MQ, you can have thread or separate application to move data from queue to main database repository.

Drawback to this solution is that there will be some lag between sensor having reported some data and that data appearing in main database.

Advantage in this solution is that it will give you high performance, scalability, and fail-over.

Gladwin Burboz 2010-01-23 15:04:37

I am sorry, I think 10,000 sensors each sending **1** data per 15 minutes will only result in 10,000 rows of data per 15 minutes. Or... are you talking about 10,000 sensors or 150,000 sensors?

Bandi-T 2010-01-23 15:11:24

Sorry my bad, don't know what I was smoking there. Updated the solution.

Gladwin Burboz 2010-01-23 15:36:47

Extended by recommendation for Java/JEE solution

Gladwin Burboz 2010-01-23 16:47:05

Bahh, this is ENTIRELY TOO BIG for my taste! :) I would go with a small and streamlined daemon running on unix, accepting all the connections from the sensors and batching inserts into the database. See http://www.kegel.com/c10k.html for details. If you need fault-tolerance, go with some kind of HA/clustering on top of that.

Bandi-T 2010-01-23 17:17:57

@Bandi: Yes it is big but so is the requirement. Streamline daemon you mentioned can't give quick transaction response even with HA/clustering as database will be bottleneck. Further more how do you define a process of managing of adding/removing more sensors and daemons (scalability) without affecting business? How would sensors know which daemon to send data to? What happens if one of daemon goes down?

Gladwin Burboz 2010-01-23 18:52:56

@Bandi: Link you gave is nice but it is just talking two aspects that is different threading strategies and use of non-blocking IO instead of blocking IO. This aspects cover only how an application instance in above solution will handle each client (sensor).

Gladwin Burboz 2010-01-23 18:53:56

@GB: I was able to INSERT about 8000 rows a second into a database in Postgres in the year 1999 on my Pentium 90 MHz desktop computer. And that was something like Postgres 6.x, that is without all the transaction logs and checkpoints 8.x has and the resulting speedups. With 1000 sensors we are talking about 1 row/second - scale that with the number of thousands of sensors: if you have 15 000 sensors that is 15 rows/second, if you have 150.000 you have 150 rows/second. I would find it hard to believe that in 2010 any reasonably designed system would not be able to **easily** cope with that.

Bandi-T 2010-01-23 19:43:18

@GB: I have a better question for you: what will your sensors do when there is a network outage? They can't access the database server, they can't access your cluster, they can't access your message queue. What are your sensors going to do with the measurements? Buffer them, or throw them away? Are you really prepared to lose measurement data just because of a network failure?

Bandi-T 2010-01-23 19:47:21

@Bandi(network outage): Having a reliable network is totally different issue. And in this end of the world scenario, it depends what business dictates and what are capabilities of the sensor.

Gladwin Burboz 2010-01-23 20:28:39

@Bandi (8000 rows a second) - 1: Did you considered network lag? Also adding transaction log and check points will decrease performance not increase.

Gladwin Burboz 2010-01-23 20:32:56

@Bandi (8000 rows a second) - 2: I am not clear about your calculations. 1000 sensors = 1 row/sec, 15000 sensors = 15 rows/sec, 150,000 sensors = 150 rows/sec. How is increasing sensors, increasing database performance?

Gladwin Burboz 2010-01-23 20:37:57

@GB: I am sorry, I must bow out as we are way beyond the OP's original question.

Bandi-T 2010-01-24 00:17:42

+2 A:

This means you are going to get about 1 record/second multiplied by how many thousand sensors you have, or about 2.5 million rows/month multiplied by how many thousand sensors you have.

Postgres has inheritance and partitioning. That would make it practical to have tables like:

sensordata_current
sensordata_2010_01
sensordata_2009_12
sensordata_2009_11
sensordata_2009_10
.
.
.

each table containing measurements for one month. Then a parent table sensordata can be created that "consists" of these child tables, meaning queries against sensordata would automatically go through the child tables, but only the ones which the planner deduces can contain data for that query. So if you say partitioned your data by months (which is a date range), and you expressed that wish with a date constraint on each child table, and you query by date range, then the planner - based on the child table constraints - will be able to exclude those child tables from execution of the query which do not contain rows satisfying the date range.

When a month is complete (say 2010 Jan just turned 2010 Feb), you would rename sensordata_current to the just completed month (2010_01), create a new sensordata_current, move over any rows from 2010_01 into the newly created sensordata_current that have a timestamp in Feb, add finally a constraint to 2010_01 that expresses that it only has data in 2010 Jan. Also drop unneeded indices on 2010_01. In Postgres this all can be made atomic by enclosing it into a transaction.

Alternatively, you might need to leave _current alone, and create a new 2010_01 and move over all January rows into it from _current (then optionally vacuum _current to immediately reclaim the space - though if your rows are consant size, with recent Postgres versions there is not much point in doing that). Your move (SELECT INTO / DELETE) will take longer in this case, but you won't have to write code to recreate indices, and this would also preserve other details (referential integrity, etc.).

With this setup removing old data is as quick and efficient as dropping child tables. And migrating away old data is efficient too since child tables are also accessible directly.

For more details see Postgres data partitioning.

Bandi-T 2010-01-23 15:10:35

Advantage of above solution is that if you are reteriving data for 2009_12 then dbms only scans one file sensordata_2009_12. But this solution will still be a problem if thousands of sensors issue insert statement at once causing database bottleneck and some of transactions causing transaction timeout.

Gladwin Burboz 2010-01-23 19:03:56

I would consider that to be out of scope of the question of the OP. With that said, I hope noone ever considered the thousands of sensors to connect directly to the database to issue INSERT statements?! And the moment you have a centralized server that sensors connect to, you can easily batch the received measurements into transactions of several (a hundred, a thousand) INSERTs (or even use bulk load if your database has it and it performs better), significantly reducing transaction load on the database.

Bandi-T 2010-01-23 19:32:58

+3 A:

I'd have a couple of concerns about this design:

Hibernate is an ORM tool. It demands an object model on one side and a relational one on the other. Do you have an object representation? If not, I'd say that Hibernate isn't the way to go. If it's a simple table mapping mechanism you'll be fine.
Your situation sounds like war: long periods of boredom surrounded by instants of sheer terror. I don't know if your design uses asynchronous mechanisms between the receipt of the sensor data and the back end, but I'd want to have some kind of persistent queuing mechanism to guarantee delivery of all the data and an orderly line while they were waiting to be persisted. As long as you don't need to access the data in real time, a queue will guarantee delivery and make sure you don't have thousands of requests showing up at a bottleneck at the same time.
How are you time stamping the sensor items as they come in? You might want to use a column that goes down to nanoseconds to get these right.

Are the sensors event-driven or timed?

Sounds like a great problem. Good luck.

duffymo 2010-01-23 15:20:45

@OP: Yes, it is a good idea to have the sensors report their data not all at the same time, but as evenly distributed over the period as possible. And I agree with not necessarily recommending Hibernate for this - you have millions of rows of data, that is probaby big enough to warrant having as efficient and as direct a representation as possible.

Bandi-T 2010-01-23 15:25:22

Persistent queuing mechanism is definately a good idea. Add application clustering to that and it would be highly scaleable.

Gladwin Burboz 2010-01-23 15:46:15

@GB: I would go another way, and make the sensors are smart enough to batch their measurements and keep retrying (in a smart way) if the central database times out on them or doesn't acknowledge having accepted their measurements.

Bandi-T 2010-01-23 16:06:53

@Bandi: Imposing such restrictions on this sensors would be hard as I assume sensors will have limited resources. Also see my Java/JEE solution.

Gladwin Burboz 2010-01-23 16:50:46

@GB: well, maybe, but maybe not - although I would think the same, on the other hand the fact that the sensors will report over TCP/IP (as everyone seems to be assuming) does already assume quite some processing power in the sensors.

Bandi-T 2010-01-23 17:13:06

@Duffymo : well there is a object model for sensors update. This same application can be used for different kinds of sensors and their message structure will vary (and in some cases their message structure will evolve ..and one would need to archive and make sense of the old data) So this is a whole other exercise of tying up the whole MessageFormat system to the db (ie each stored message would know which message format it belonged to, which will help in making reports). I am still investigating if ORM in this scenario is helpful or the opposite.

geoaxis 2010-01-23 23:01:50

@BrandiT. sensors are off the shelf, not much possibility to tinker with them@GB, sensors are indeed very power sensitive in most of the cases.

geoaxis 2010-01-23 23:02:42

Is it a requirement that these sensors connect directly to an application to upload their data? And this application is responsible for writing the data to the database?

I would consider having the sensors write data to a message queue instead, and having your "write to DB" application be responsible for picking up new data from the queue and writing it to the database. This seems like a pretty classic example of "message producers" and "message consumers", i.e. the sensors and the application, respectively.

This way, the sensors are not affected if your "write to DB" application has any downtime, or if it has any performance issues or slowdowns from the database, etc. This would also allow you to scale up the number of message consumers in the future without affecting the sensors at all.

It might seem like this type of solution simply moves the possible point of failure from your consumer application to a message queue, but there are several options for making the queue fault-reliant - clustering, persistent message storage, etc.

Apache MQ is a popular message queue system in the Java world.

matt b 2010-01-23 15:36:59

Yes, this is what I had considered initially (but not put in right words).

geoaxis 2010-01-23 22:32:17

Any reason why you discarded the idea?

matt b 2010-01-25 01:40:25

@mattI meant that this was what I had in mind when describing the problem. I should have mentioned that I would be using Hibernate with ActiveMQ, which I am still am. :)

geoaxis 2010-02-22 12:55:11

ansaurus

tags:

views:

answers:

Strategies for designing a database (being accessed by hibernate) which will have alot of archivial data.

related questions