Schemaless Data Cache: NoSQL or Other Alternatives?

I am evaluating a number of NoSQL implementations (RavenDB and MongoDB at the moment) as a means of solving a specific set of requirements that involve storage/retrieval of data that is schema-less. I want to get some feedback on whether NoSQL is the direction I should be looking in, or if there are other (potentially simpler) options.

Essentially we have a software product that (among other things) defines a basic domain model that consists of a few related entities, each of which have a number of attributes (key/value). As we release to the customer, we work with them to setup the attributes and values, which is essentially the configuration of the system. This is fairly straightforward, and because the design is known up front, we don't need anything dynamic to achieve this and make it perform (we will use an RDBMS). The attributes are not known up front, but again this is not a problem as this part of the system pretty much revolves around an attribute model.

The problem is that for different customers, and AFTER we release and are in production, we find that we need to query for specific sets of attribute data that we knew nothing about when we compiled and released the code (and before we configured the attributes for the customer). We basically need to produce data from the attribute maps that we can store (we won't know the structure up front) and then query that stored data later in ways we can't anticipate. The thinking right now is that we can create hooks that get hit during processing and allow us to plug-in libraries (likely via MEF) that create the data so it gets stored, and then query it later when needed (not for reporting--usually to create additional data/attributes).

(Note that creating the hooks and plug-in libraries is a separate problem, and is not intended to be part of this question.)

A common scenario might be: "I want to know how many times xxx occurred in the last 10 days". So I would create a plug-in that would recognize that xxx has occurred, and write it to a data store with a date/time. Then I would create another plug-in (probably in the same DLL) that would perform the query, and add an attribute to the model called "CountOfxxxInLast10Days". Another scenario might be to create configurable lookups. So I might have a plug-in that runs at startup to create/update a table of lookup data that could convert one attribute value to another, or (more likely) a range of values that would convert to a lookup values. So the conversion plugin might add a table with columns: bottom_value, top_value, multiplier, and the query plugin would query the table using an attribute value, like "SELECT multiplier FROM table WHERE [attribute_value] BETWEEN bottom_value AND top_value". The result might add the result to the an attribute called "Multiplier".

In certain cases, old data could be purged after a specified period of time. In the first scenario described above, it might be desirable to remove data from the store/cache that was older than ten days.

In other cases data would need to be persisted permanently, like in the second scenario above. It's possible this data could simply be re-created at startup, as opposed to held in a permanent store.

Additional requirements:

The datastore/cache can be backed up and restored while online
Can be replaced/recovered from the last backup in the case of a crash
Data survives events like machine reboot
Proven/production-tested technology

We are pretty committed to the .Net platform at this point, so any option would have to have a solid .Net client/API.

There are three possible options, each with pros and cons.

Reuse the RDBMS

You're already storing the entities in a relational database. You can store the undefined attributes in an extra table, that has a Key and Value column, and an EntityId column that references the entity to which the attributes belong. Basically, you'll be using part of your database as a key-value store.

Advantages:

All your data is stored in a single database, meaning:
- you can retrieve an entity and all of its attributes in a single query,
- your application is less complicated, as it only has to interact with a single database.
You get all the ACID advantages of a relational database.

Disadvantages:

Relational databases aren't built to be key-value stores, so you may have performance issues. However, I expect the performance hit to be minimal, unless you plan to store a very, very large amount of attributes.

Use a key-value store

Key-value stores, such as Redis and Riak, or the more advanced Apache Cassandra, are optimized for storing key-value pairs (no surprise there...). You can use a key-value store next to your RDBMS, dedicated to storing the attributes, while keeping the entities in your RDBMS.

Advantages:

Better performance than you'll get from a RDBMS, especially with large amounts of data.
Easier to scale out, as they are not constrained by ACID properties.

Disadvantages:

No guaranteed ACID properties but so-called eventual consistency, meaning that the stored data may not always be consistent across servers. However, you'll only have to deal with this if you're scaling out. Also, most key-value stores allow you to tune its strictness regarding consistency, to help overcome this problem.
Your application will run on two separate databases, increasing the complexity of your application.

Use a document database

You could use a document database to store just the attributes. But you can also take the plunge and store everything in a document database, including your entities.

Advantages:

All your data is stored in a single database, meaning:
- you can retrieve an entity and all of its attributes in a single operation, as you would store an entire entity, including its attributes, in a single document.
- your application is less complicated, as it only has to interact with a single database.
Easier to scale out, as they are not constrained by ACID properties.
Document databases aren't restricted to just key-values, so if you ever need to store a more complex attribute, you're already good to go.

Disadvantages:

No ACID guarantees, just like key-value stores. Most document databases can be tuned to overcome consistency problems though.
No understanding of relations between entities as in an RDBMS. A relational model is normalized, whereas documents are denormalized, to overcome having many relations. This may or may not be a big disadvantage, depending on your exact domain model.

Mature document database technologies

Apache CouchDB has quite a list of applications using it and receives positive feedback from the Stack Overflow community. It has a few drivers for .NET, but I can't tell you how mature these drivers are.

MongoDB has quite an impressive list of production employments. There are three major drivers for .NET available, which all seem to be of good quality.

RavenDB has excellent support for .NET as it was designed for the .NET platform. However, I haven't been able to find examples of large production environments running on RavenDB. Still, I think it's definitely worth exploring.

I don't have much hands-on experience with any of them in production environments, so I don't know exactly how easy they are to backup/restore. But given the fact that these NoSQL systems aren't as rigid as RDBMS systems, I guess they should be easier to backup/restore without downtime than an RDBMS.

Thanks for your extremely detailed answer. Note that the schemaless part of the system is not just a key/value store (look at the two mentioned scenarios).

Phil Sandler 2010-08-13 19:12:09

You're welcome :) Yes, I now realize you have more schema-less data that just key-value pairs. In that case you could use a document database as a separate database, as I described under 'Use a key-value store'.

Niels van der Rest 2010-08-13 19:19:51

ansaurus

tags:

views:

answers: