views:

349

answers:

3

I have recently been researching NoSql options. My scenario is as follows:

We collect and store data from custom hardware at remote locations around the world. We record data from every site every 15 minutes. We would eventually like to move to every 1 minute. Each record has between 20 and 200 measurements. Once set up the hardware records and reports the same measurements every time.

The biggest issue we are facing is that we get a different set of measurements from every project. We measure about 50-100 different measurement types, however any project can have any number of each type of measurement. There is no preset set of columns that can accommodate the data. Because of this we create and build each projects data table with the exact columns it needs as we set up and configure the project on the system.

We provide tools to help analyze the data. This typically includes more calculations and data aggregation, some of which we also store.

We are currently using a mysql database with a table for each client. There are no relations between tables.

NoSql seems promising because we could store a project_id, timestamp then the rest would not be preset. This means one table, more relationships in the data, yet still handling the variety of measurements.

Is a 'NoSql' solution right for this job? If so which ones?

I have been investigation MongoDB and it seems promising...

Example for Clarification:

Project 1 has 5 data points recorded, the mysql table columns look like: timestamp, temp, wind speed, precipitation, irradiance, wind direction

Project 2 has 3 data points recorded mysql table columns: timestamp, temp, irradiance, temp2

A: 

I am assuming if you have a lot of clients you will end up having a lot of tables. I would first remove this restriction and move to a single table model or having a table for clients and data with appropriate relations. This way you could keep mysql. Don't assume mysql is bad for everything.

In terms of NOSQL this depends on your data model and usage patterns but if you have lots of clients and you prefer this model then couchdb views could solve that problem as couchdb can support thousands of views. You can store all the data in one database in couchdb but have a view for each client. I have no idea how mongodb could solve this problem.

abdollar
I should be more specific. We use a mysql db for relational data and do have multi-tenancy. Because the data points are different for each client they each have their own data table that stores strictly measurement data with each measurement having it's own column.
Alan Peabody
I don't think the author is necessarily saying that MySQL is bad, but more wondering if there's a better option. A key/value store seems like a great way to accomodate different columns for each table.
Beerlington
You mentioned that "We are currently using a mysql database with a table for each client" - that seems silly if you have a lot of clients. Maybe I just don't understand the example yet as it still seems really vague.
abdollar
The alternative would be to put it all in one table with possibly 500+ columns most of which would be null for every project. This seems like a worse alternative to me.Unfortunately every site records a different number of measurements. There is no consistency nor is it possible to enforce any.
Alan Peabody
If you can represent each client as a number from 0 to 65k, then you only need one column of type SMALLINT to be able to mark a bit for each client. You can use this to write SQL to get the set of clients or test for a single client for a record of data.
abdollar
If you are really looking at 500 columns for custom data then couchdb would be a better solution that way you can keep one database but have a lot of views of that data
abdollar
Your last comment has been the most helpful. I am looking at nosql options as the columns are always different per client and it would allow us to get away from the 1 table per client setup. The ability to store the two things we would need (project/client id, timestamp) (indexed) as well as all the different measurements (once again varies completely by project).
Alan Peabody
+2  A: 

Ok, I might get flamed for not answering your question directly but I'm going to say it anyway because I think it's something you should consider. I don't have experience with NOSQL databases so I can't recommend one but as far as relational databases go there might be a better design for your situation.

First of all - drop the 1 table per customer. Instead, I would architect a many to many schema in which there would be the following tables:

  • Customers
  • MeasurementTypes
  • Measurements

The Customers table will contain customer information, and a unique CustomerID field:

   CustomerID      | CustomerName  |   ..and other fields
 ---------------------------------------------------------------------

The MeasurementTypes table would describe each type of measurement that you support, and assign a unique name (the MeasurementType field) to refer to it:

   MeasurementType | Description   |  ..and other pertinent fields
 ---------------------------------------------------------------------

The Measurements table is where all the data is aggregated. You would have one record for each data point collected, stamped with the customer id, the measurement type, a time stamp, and a unique "batch" identifier (to be able to group data points from each measurement together) - and of course the measurement value. If you need different types of values for your measurements you may need to get a little creative with the design but most likely the measurement values can all be represented by a single data type.

  Customer  | MeasurementBatch |  MeasurementType  |  Timestamp  |     Value   |
--------------------------------------------------------------------------------
      1     |    {GUID}        |  'WIND_SPEED'     |      ...    |    ...
--------------------------------------------------------------------------------
            |                  |                   |             |             |

This way, you can have a very flexible design that would allow you to add as many data points for each customer independently from other customers. And you get the benefits of relational databases..

If your SQL engine supports this feature you could even partition the Measurements table by the customer column.

Hope this helps..

EDIT

I must mention that I'm not in any way affiliated with Microsoft nor am I trying to give them free advertising - it just so happens I'm most familiar with their SQL server.

Based on Alan's comment - regarding whether a SQL database can support a data volume of a few thousand million records per year with the possibility of growing up to a billion records per year - there is a nice summary of limitations/specs for MS SQL server available here:

http://msdn.microsoft.com/en-us/library/ms143432.aspx

It seems that the only limitation to how many records you can have per table is the available size on disk (and probably RAM if you're going to want to run certain reports on that data).

Miky Dinescu
We have looked into something like this. The issue we foresee would is the size of the measurements table. At our current 15 minute data collection rate we would hit 5.25 million records per project per year. When we move to a one minute interval we are talking 78.8 million records per year per project. Then with 100 projects we would hit 7 billion records per year. Is that something MySql can handle?
Alan Peabody
I'm not sure about MySQL's limitations in terms of # of records per table but with careful database design (i.e. partitioning) a product like MS SQL Server or Oracle should have no problem handling a few hundred million records in a table.. Especially since each record would be rather small. You could use an integer identifier for the MeasurementType and use a unix timestamp for and a record would come out to between 32 and 44 bytes depending of the data type you use for Value.
Miky Dinescu
+3  A: 

The simple answer is that there is no simple answer to these sort of problems, the only way to find out what works for your scenario is to invest R&D time into it.

The question is hard to answer because the performance requirements aren't spelled out by the OP. It appears to be 75M/year records over a number of customers with a write rate of num_customers*1minute (which is low), but I don't have figures for the required read / query performance.

Effectively you have already a sharded database using horizontal partitioning because you're storing each customer in a seperate table. This is good and will increase performance. However you haven't yet established that you have a performance problem, so this needs to be measured and the problem size assessed before you can fix it.

A NoSQL database is indeed a good way of fixing performance problems with traditional RDBMS, but it will not provide automatic scalabity and is not a general solution. You need to find your performance problem fix and then design the (nosqL) data model to provide the solution.

Depending on what you're trying to achieve I'd look at MongoDB, Apache Cassandra, Apache HBase or Hibari.

Remember that NoSQL is a vague term typically encompassing

  • Applications that are either performance intensive in read or write. Often sacrificing read or write performance at the expense of the other.
  • Distribution and scalability
  • Different methods of persistency (RAM/Disk)
  • A more structured/defined access pattern making ad-hoc queries harder.

So, in the first instance I'd see if a traditional RDBMS can achieve the required performance, using all available techniques, get a copy of High Performance MySQL and read MySQL Performance Blog.

Rev1:

In light of your comments I think it is fair to say that you could achieve what you want with one of the above NOSQL engines.

My primary recommendation would be to get your data model designed and implemented, what you're using at the moment isn't really right.

So look at Entity-attribute-value model as I think it is exactly right for what you need.

You need to get your data model right before you can consider which technology to use, being honest modifying schemas dynamically isn't a datamodel.

I'd use a traditional SQL database to validate and test the new datamodel as the management tools are better and it's generally easier to work with the schemas as you refine the datamodel.

Richard Harrison
One of the biggest reasons we are looking into a NoSQL option is for greater flexibility with our database columns. One project might have 5 columns, another might have 150. There are also cases where we might need to change the columns once a project is live. We call this process "rebuilding the table" where our application redefines the table structure by adding or remove fields.
Beerlington
EAV sounds right; to be honest "rebuilding the table" is a horrible way of storing data, a clue that the schema is wrong. What do you think of EAV?
Richard Harrison
Rebuilding the table is a horrible approach, and is why we are looking to change it. :) - EAV is the same approach Miky D recommended below. With two recommendations we are going to take a closer look at it. I think we are going to branch in a few weeks and try both a EAV and a MonogoDB approach. I am not going to mark either answer as accepted right now, but will update this question as we go along with our findings of what works best in our exact situation. Thank you for the suggestion and links.
Alan Peabody