Here is an idea I have been thinking about but have not benchmarked. I expect it to be great at certain situations:
- Insert throughput must be high
- Fetching individual documents by key is not required
- All data is fetched through views (possibly a different machine from the one receiving the inserts)
The Plan
Insert batches-of-batches of documents, and use views to serialize them back out nicely.
Example
Consider a log file with a simple timestamp and message string.
0.001 Start
0.123 This could be any message
0.500 Half a second later!
1.000 One second has gone by
2.000 Two seconds has gone by
[...]
1000.000 One thousand seconds has gone by
You might insert logs one message per document, e.g.:
{ "_id": "f30d09ef6a9e405994f13a38a44ee4a1",
"_rev": "1-764efa883dda1e11db47671c4a3bbd9e",
"timestamp": 0.123,
"message": "This could be any message"
}
The standard bulk docs optimization
The first optimization is insert using _bulk_docs
as in the CouchDB bulk-docs documentation.
A secondary bulk insert optimization
However, a second optimization is to pre-batch the logs into one larger Couch document. For example, in batches of 4 (in the real world this would be much higher):
{ "_id": "f30d09ef6a9e405994f13a38a44ee4a1",
"_rev": "1-764efa883dda1e11db47671c4a3bbd9e",
"logs": [
{"timestamp": 0.001, "message": "Start"},
{"timestamp": 0.123, "message": "This could be any message"},
{"timestamp": 0.500, "message": "Half a second later!"},
{"timestamp": 1.000, "message": "One second has gone by"}
]
}
{ "_id": "74f615379d98d3c3d4b3f3d4ddce82f8",
"_rev": "1-ea4f43014d555add711ea006efe782da",
"logs": [
{"timestamp": 2.000, "message": "Two seconds has gone by"},
{"timestamp": 3.000, "message": "Three seconds has gone by"},
{"timestamp": 4.000, "message": "Four seconds has gone by"},
{"timestamp": 5.000, "message": "Five seconds has gone by"},
]
}
Of course, you would insert these via _bulk_docs
as well, effectively inserting batches of batches of data.
Views are still very easy
It is still very easy to serialize the logs back out into a view:
// map
function(doc) {
if(doc.logs) {
// Just unroll the log batches!
for (var i in doc.logs) {
var log = doc.logs[i];
emit(log.timestamp, log.message);
}
}
}
It will then be quite easy to fetch logs with timestamps between startkey
, endkey
, or whatever other needs you have.
Conclusion
This is still not benchmarked, but my hope is that, for some kinds of data, batching into clumps will reduce the internal B-tree writes. Combined with _bulk_docs
, I hope to see insert throughput hit the hardware speeds of the disk.