views:

101

answers:

3

I'm considering using MongoDB or CouchDB on a project that needs to maintain historical records. But I'm not sure how difficult it will be to store historical data in these databases.

For example, in his book "Developing Time-Oriented Database Applications in SQL," Richard Snodgrass points out tools for retrieving the state of data as of a particular instant, and he points out how to create schemas that allow for robust data manipulation (i.e. data manipulation that makes invalid data entry difficult).

Are there tools or libraries out there that make it easier to query, manipulate, or define temporal/historical structures for key-value stores?

edit:

Note that from what I hear, the 'version' data that CouchDB stores is erased during normal use, and since I would need to maintain historical data, I don't think that's a viable solution.

P.S. Here's a similar question that was never answered: key-value-store-for-time-series-data

+1  A: 

Yes, in CouchDB the revisions of a document are there for replication and are usually lost during compaction. I think UbuntuOne did something to keep them around longer but I'm not sure exactly what they did.

I have a document that I need the historical data on and this is what I do.

In CouchDB I have an _update function. The document has a "history" attribute which is an array. Each time I call the _update function to update the document I append to the history array the current document (minus the history attribute) then I update the document with the changes in the request body. This way I have the entire revision history of the document.

This is a little heavy for large documents, there are some javascript diff tools I was investigating and thinking about only storing the diff between the documents but haven't done it yet.

http://wiki.apache.org/couchdb/How%5Fto%5Fintercept%5Fdocument%5Fupdates%5Fand%5Fperform%5Fadditional%5Fserver-side%5Fprocessing

Hope that helps.

mikeal
Thanks Mike. That looks like a reasonable solution, and certainly sounds easy enough. I wonder if there are any papers or articles out there that explore this topic in relation to key-value stores in more details .
btelles
The issue is a little more complicated because some key-value stores do keep track of changes (like CouchDB) but their internal system for tracking changes might not be good for storing changes permanently (like CouchDB) :)MongoDB's indexing system would be painful to use if you needed to index the history, while CouchDB's map/reduce is sufficiently flexible.I don't think this can be generalized to a question about key-value stores in general.
mikeal
Gotchya. Thanks.
btelles
+1  A: 

I can't speak for mongodb but for couchdb it all really hinges on how you write your views.

I don't know the specifics of what you need but if you have a unique id for a document throughout its lifetime and store a timestamp in that document then you have everything you need for robust querying of that document.

For instance:

document structure:

{ "docid" : "doc1", "ts" : <unix epoch> ...<set of key value pairs> }

map function:

function (doc) {
  if (doc.docid && doc.ts)
    emit([doc.docid, doc.ts], doc);
  }
}

The view will now output each doc and its revisions in historical order like so:

["doc1", 1234567], ["doc1", 1234568], ["doc2", 1234567], ["doc2", 1234568]

You can use view collation and start_key or end_key to restrict the returned documents.

start_key=["doc1", 1] end_key=["doc1", 9999999999999]

will return all historical copies of doc1

start_key=["doc2", 1234567] end_key=["doc2", 123456715]

will return all historical copies of doc2 between 1234567 and 123456715 unix epoch times.

see ViewCollation for more details

Jeremy Wall
+1  A: 

There are a couple options if you wanted to store the data in MongoDB. You could just store each version as a separate document, as then you can query to get the object at a certain time, the object at all times, objects over ranges of time, etc. Each document would look something like:

{
    object : whatever,
    date : new Date()
}

You could store all the versions of a document in the document itself, as mikeal suggested, using updates to push the object itself into a history array. In Mongo, this would look like:

db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})

// make changes to obj
...

db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})

A cooler (I think) and more space-efficient way, although less time-efficient, might be to store a history in the object itself about what changed in the object at each time. Then you could replay the history to build the object at a certain time. For instance, you could have:

{
    object : startingObj,
    history : [
        { date : d1, addField : { x : 3 } },
        { date : d2, changeField : { z : 7 } },
        { date : d3, removeField : "x" },
        ...
    ]
}

Then, if you wanted to see what the object looked like between time d2 and d3, you could take the startingObj, add the field x with the value 3, set the field z to the value of 7, and that would be the object at that time.

Whenever the object changed, you could atomically push actions to the history array:

db.foo.update({object : startingObj}, {$push : {history : {date : new Date(), removeField : "x"}}})
kristina
Thanks Kristina. That's a really nifty way of doing the changes. We're also looking at being able to query data in the Oracle-like form of: SELECT * FROM X AS OF yyyy-mm-dd. Or something that would be similarly simple. The problem with the method you proposed, then, would be that we'd have to recreate each object every time the query is run. We're running into a similar problem with our existing SQL database that was also designed with space in mind. But since space isn't really an issue, keeping a full copy probably meets our needs a little more. Thanks though!
btelles