views:

286

answers:

3

I've tested NoSQL databases like CouchDB, MongoDB and Cassandra and observed tendence to absorbing very large amount of drive space relative to inserted key-value pairs. When comparing CouchDB and MySQL schemaless databases CouchDB is consuming much more drive space than MySQL. I know about that key-value DBs by default are versioning and have long uuid and need key optimalisation - the comparison was between about 15 mln rows in MySQL and 1-5 mln documents listed NoSQL DB's.

My question is : Is there any NoSQL with good compaction / compression of data? So that I can have NoSQL database with a size closer to 5GB than 50GB?

+1  A: 

Are you checking the "file length" or the actual allocation size?

Many databases sparsely allocate file structures and their "length" is much larger than their on-disk size.

Joe Koberg
I check that too, that file buffer isn't so big so I not even consider that in db like 15 mln documents (even if it will be few GB). I think this "space hungry" is weekness of shemaless db's but I'm not sure.
inquisitor
+3  A: 

Disk space is about the cheapest resource today, so if you can trade it for less seeks or less CPU used it is a good trade to make. That is what Cassandra does.

jbellis
+1  A: 

MongoDB has a "database repair" function that also performs a compaction. However, such a compaction is not going to happen while the DB is running.

But if DB space is a serious issue, then try setting up a MongoDB master/slave pair. As the data needs compaction, run the repair on the slave, allow it to "catch up" and then switch them over. You can now safely compact the master instead.

But I have to echo jbellis's comment: you will probably need more space and most of these products are making the assumption that disk space is (relatively) cheap. If disk space is really tight, then you'll find that MongoDB is reasonably sized, but it's going to have a difficult time competing with tabular CSV data.

Think of it this way, what's more space efficient?

  • a CSV file with a million lines
  • that same data formatted in JSON

Obviously the JSON is going to be longer b/c you're repeating the field names every time. The only exception here is a CSV file with like 100 columns of which only a few are filled for each row. (but that's probably not your data)

Gates VP
That's true, if you use long field names you need more disk space when using Mongodb. And Mongodb preallocates files of 2 gigabyte.
TTT
Yes, CouchDB has "compact" option too witch after test reduce db size several times (Cassandra do it like "in background" because of better organized bulk writes).
inquisitor