views:

404

answers:

2

I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :

  • store them securely
  • use them to analysis (mostly time-oriented)
  • retrieve some raw data occasionally
  • It would be nice if it could be used with JasperReports or BIRT

My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL

On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .

I heard about Hypertable, Pentaho, MongoDB ....

Do you have any recommendations ?

(Yes, I found some topics here, but it was year or two ago)

Edit: Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?

A: 

If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.

As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.

My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.

blockcipher
+1  A: 

You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.

GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.

Mason
Just adding, yes, I work for EnterpriseDB, who sponsors GridSQL.
Mason