views:

1560

answers:

4

Hi!

I made some tests of speed to compare MongoDB and CouchDB. Only inserts were while testing. I got MongoDB 15x faster than CouchDB. I know that it is because of sockets vs http. But, it is very interesting for me how can I optimize inserts in CouchDB?

Test platform: Windows XP SP3 32 bit. I used last versions of MongoDB, MongoDB C# Driver and last version of installation package of CouchDB for Windows.

Thanks!

+3  A: 

I don't think that the difference between sockets and http is the only difference. The difference is also related to the disk syncs (fsync). This affects durability. MongoDB stores everything in RAM first and it only syncs to disk at certain intervals, unless you explicitly tell MongoDB to do an fsync.

Read about durability and MongoDB: http://blog.mongodb.org/post/381927266/what-about-durability and fsync: http://www.mongodb.org/display/DOCS/fsync+Command

TTT
@TTT, I forgot about data flushes;) Thank you!
Edward83
+5  A: 

For inserting lots of data into the DB in bulk fashion, CouchDB supports bulk inserts which are described in the wiki under HTTP Bulk Document API.

Additionally, check out the delayed_commits configuration option, and the batch=ok option described in the above link. Those options enable similar memory-caching behavior with periodic syncing against he disk.

jhs
You right! It helps very much;) I tried to store 1M data structs within 1000 by 1000 chunks. Speed was extremly high! Thank you guys. Now I have solution;)
Edward83
You're welcome! I will add another more controversial answer next.
jhs
Of course, MongoDB also supports bulk inserts... which again make it 15x faster than Couch.
kristina
+1  A: 

Here is an idea I have been thinking about but have not benchmarked. I expect it to be great at certain situations:

  • Insert throughput must be high
  • Fetching individual documents by key is not required
  • All data is fetched through views (possibly a different machine from the one receiving the inserts)

The Plan

Insert batches-of-batches of documents, and use views to serialize them back out nicely.

Example

Consider a log file with a simple timestamp and message string.

0.001 Start
0.123 This could be any message
0.500 Half a second later!
1.000 One second has gone by
2.000 Two seconds has gone by
[...]
1000.000 One thousand seconds has gone by

You might insert logs one message per document, e.g.:

{ "_id": "f30d09ef6a9e405994f13a38a44ee4a1",
  "_rev": "1-764efa883dda1e11db47671c4a3bbd9e",
  "timestamp": 0.123,
  "message": "This could be any message"
}

The standard bulk docs optimization

The first optimization is insert using _bulk_docs as in the CouchDB bulk-docs documentation.

A secondary bulk insert optimization

However, a second optimization is to pre-batch the logs into one larger Couch document. For example, in batches of 4 (in the real world this would be much higher):

{ "_id": "f30d09ef6a9e405994f13a38a44ee4a1",
  "_rev": "1-764efa883dda1e11db47671c4a3bbd9e",
  "logs": [
    {"timestamp": 0.001, "message": "Start"},
    {"timestamp": 0.123, "message": "This could be any message"},
    {"timestamp": 0.500, "message": "Half a second later!"},
    {"timestamp": 1.000, "message": "One second has gone by"}
  ]
}

{ "_id": "74f615379d98d3c3d4b3f3d4ddce82f8",
  "_rev": "1-ea4f43014d555add711ea006efe782da",
  "logs": [
    {"timestamp": 2.000, "message": "Two seconds has gone by"},
    {"timestamp": 3.000, "message": "Three seconds has gone by"},
    {"timestamp": 4.000, "message": "Four seconds has gone by"},
    {"timestamp": 5.000, "message": "Five seconds has gone by"},
  ]
}

Of course, you would insert these via _bulk_docs as well, effectively inserting batches of batches of data.

Views are still very easy

It is still very easy to serialize the logs back out into a view:

// map
function(doc) {
  if(doc.logs) {
    // Just unroll the log batches!
    for (var i in doc.logs) {
      var log = doc.logs[i];
      emit(log.timestamp, log.message);
    }
  }
}

It will then be quite easy to fetch logs with timestamps between startkey, endkey, or whatever other needs you have.

Conclusion

This is still not benchmarked, but my hope is that, for some kinds of data, batching into clumps will reduce the internal B-tree writes. Combined with _bulk_docs, I hope to see insert throughput hit the hardware speeds of the disk.

jhs
Interesting, I was wondering could you write that view definition in Erlang? I've read that Erlang is faster.
TTT
I actually did something like this for test results at Mozilla about a year ago. You definitely bottleneck on view generation. At the time the older spidermonkey was a HUGE bottleneck for large documents so I wrote a new Python view server. http://www.mikealrogers.com/archives/673 I bet with a newer spidermonkey the JSON serialize problem goes away, this would be worth benchmarking again. But remember, the view results are still an on-disc btree and the view isn't available until fsync finishes so it won't be comparable with Redis or MongoDB.
mikeal
Mikeal, yeah I think it would only shine in very specialized situations. For example, logs are dumped into a DB which replicates to, I don't know, maybe a Lounge, which can chew through all the view rows you need. On the one hand, it would be cool to insert at disk speeds; but that would be the ultimate "benchmark" (used pejoratively)
jhs
+7  A: 

Just to iterate on the sockets vs HTTP and fsync vs in-memory conversation.

By default, MongoDB doesn't return a response on a write call. You just write your data to the socket and assume it's in the DB and available. Under concurrent load this could get backed up and there isn't a good way to know how fast Mongo really is unless you use an optional call that will return a response for the write once the data is available.

I'm not saying Mongo insert performance isn't faster than Couch, inserting in to memory is a lot faster than fsyncing to disc, the bigger difference here is in the difference in goals MongoDB and CouchDB have about consistency and durability. But all the "performance" tools I've seen for testing Mongo use the default write API so you aren't really testing insert performance you're testing how fast you can flush to a socket.

I've seen a lot of benchmarks that show Mongo as faster than Redis and memcached because they fail to realize that Redis and Memcached return a response when the data is in memory and Mongo does not. Mongo definitely is not faster than Redis :)

mikeal