Is Cassandra suitable enough for storing logs in term of disk space usage?

I guess you mean one row (with four columns) inside your column family? The "overhead" associated with each column is a a long (timestamp, 64 bits) and a byte[] (column name, max 64 kb). So 4x disk usage seems a little bit weird. Are you doing any deletes? Be sure to understand how deletes are done in a distributed, eventually consistent system.

Be sure to read about "compactions" also. ("Once compaction is finished, the old SSTable files may be deleted")

Would also like to remind you of a Thrift limitation regarding how streaming is done.

Cassandra's public API is based on Thrift, which offers no streaming abilities -- any value written or fetched has to fit in memory. This is inherent to Thrift's design and is therefore unlikely to change. So adding large object support to Cassandra would need a special API that manually split the large objects up into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265. As a workaround in the meantime, you can manually split files into chunks of whatever size you are comfortable with -- at least one person is using 64MB -- and making a file correspond to a row, with the chunks as column values. (From the 'Cassandra Limitations' page on the wiki)

Schildmeijer, actually I was wrong with Cassandra disk space usage when I submitted my question (you are right, I didn't run compaction). So here are true figures (I also updated original question):- Original log size is 224Mb- Cassandra data file is 557Mb- Cassandra index file is 109MbI'm not doing any deletes. I put every log line into Cassandra separately and the longest line is about 1kb.Still 2x overhead is somewhat big for my purpose storing longs - is there any way to optimize that?Thanks!

sha1dy 2010-06-29 07:55:15

ansaurus

tags:

views:

answers:

Is Cassandra suitable enough for storing logs in term of disk space usage?

related questions