We are currently considering a change from Postgres to CouchDB for a usage monitoring application. Some numbers:
Approximately 2000 connections, polled every 5 minutes, for approximately 600,000 new rows per day. In Postgres, we store this data, partitioned by day:
t_usage {service_id, timestamp, data_in, data_out}
t_usage_20100101 inherits t_usage.
t_usage_20100102 inherits t_usage. etc.
We write data with an optimistic stored proc that presumes the partition exists and creates it if necessary. We can insert very quickly.
For reading of the data, our use cases, in order of importance and current performance are:
* Single Service, Single Day Usage : Good Performance
* Multiple Services, Month Usage : Poor Performance
* Single Service, Month Usage : Poor Performance
* Multiple Services, Multiple Months : Very Poor Performance
* Multiple Services, Single Day : Good Performance
This makes sense because the partitions are optimised for days, which is by far our most important use case. However, we are looking at methods of improving the secondary requirements.
We often need to parameterise the query by hours as well, for example, only giving results between 8am and 6pm, so summary tables are of limited use. (These parameters change with enough frequency that creating multiple summary tables of data is prohibitive).
With that background, the first question is: Is CouchDB appropriate for this data? If it is, given the above use cases, how would you best model the data in CouchDB documents? Some options I've put together so far, which we are in the process of benchmarking are (_id, _rev excluded):
One Document Per Connection Per Day
{
service_id:555
day:20100101
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 60,000 new documents a month. Most new data would be updates to existing documents, rather than new documents.
(Here, the objects in usage are keyed on the timestamp of the poll, and the values the bytes in and byes out).
One Document Per Connection Per Month
{
service_id:555
month:201001
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 2,000 new documents a month. Moderate updates to existing documents required.
One Document Per Row of Data Collected
{
service_id:555
timestamp:1265248762
in:584
out:11342
}
{
service_id:555
timestamp:1265249062
in:94
out:1242
}
Approximately 15,000,000 new documents a month. All data would be an insert to a new document. Faster inserts, but I have questions about how efficient it's going to be after a year or 2 years with hundreds of millions of documents. The file IO would seem prohibitive (though I'm the first to admit I don't fully understand the mechanics of it).
I'm trying to approach this in a document-oriented way, though breaking the RDMS habit is difficult :) The fact you can only minimally parameterise views as well has me a bit concerned. That said, which of the above would be the most appropriate? Are there other formats that I haven't considered which will perform better?
Thanks in advance,
Jamie.