views:

66

answers:

3

I've been playing around with the samus mongodb driver, particularly the benchmark tests. From the output, it appears the size of the documents can have a drastic effect upon how long operations on those collections take.

alt text

Is there some documentation available that recommends what balance to strive for or some more "real" numbers around what document size will do to query times? Is this poor performance more a result of the driver and any serialization overhead? Has anyone else noticed this?

+1  A: 

I cannot find a link right now, but the format of the database is such that it should not matter if a document is large or small. For access via index, there is certainly no difference, for a table scan, uninteresting documents (or uninteresting parts of documents) can be skipped quickly thanks to the BSON format. If anything, the overhead of the BSON format affects tiny documents more than large ones.

So I would assume that the performance drop you see is largely due to the serialization costs of loading those documents (of course it takes more time to write a large document to disk than a small document, but it should be about the same for multiple small documents of the same aggregate size).

In your benchmark, can you normalize the numbers to be based on the same amount of data (in bytes, not in document count)?

Thilo
It's just a bad benchmark, there is no index created.
TTT
+1  A: 

You can turn on profiling with db.setProfilingLevel(2) and query db.system.profile for details on the executed queries.

Although this may distort the test results a little, it will give you insight into the query times on the server, eliminating any influence the driver or network may have on the results. If these query times show the same pattern as your test, then the document size does influence query times. If query times are roughly the same regardless of document size, then it's serialization overhead you're looking at.

Niels van der Rest
It's just a bad benchmark, there is no index created.
TTT
@TTT: Theoretically, if there *were* indexes, the index would be queried. The documents themselves wouldn't be scanned, eliminating any influence the document size could have. For testing **ad hoc queries**, where the document size could have more impact on performance, the lack of an index is a good thing :)
Niels van der Rest
I believe that even for non-index queries individual document size should make no difference (while total document size of course does). In fact, if anything, scanning 1000 documents that make up 1 MB should be slower than scanning 1 document that makes up 1MB.
Thilo
+1  A: 

But is it a good benchmark? Don't think so. Read http://stackoverflow.com/questions/2460063/2465039#2465039 .

I think the exception that happens when the index should have been created is still swallowed. FindOne() medium return 363 with and without the "creation" of the index.

TTT
well, it should be equally bad for small or big documents (given the same total data size)
Thilo
in fact, since having no index moves more (albeit unnecessary) work into the server, it would reduce the impact of the driver-side/serialization overhead.
Thilo
Thanks for linking to the other post. Looks like that benchmark is bad. Will write my own, eventually
Ty
-1 The question is not about query times and indexes, but query times and **document size**. Try reading the question as *Will document size influence query times when querying non-indexed fields?*, instead of being fixated on the error in the benchmark test. I have run the benchmark *with* proper indexes and it still shows a performance hit for the large documents. This is probably serialization overhead. My answer tells you how to know for sure.
Niels van der Rest
You do a benchmark because you want to know if a certain system is fast enough for your needs and without indexes (but you think they are there) you get skewed results. The computer has more time for other things if there are indexes.
TTT