views:

95

answers:

1

I've a problem storing 50Gb of logs each day in a distributed environment. I looked at Hadoop HDFS but because it has problems running on Windows infrastructure, lack of multi language filesystem API it doesn't suit me very well. Cassandra on the other hand is very easy to deploy on any platform. The only big problem I'm facing is a disk space usage. Here are the figures:

  • Original log size is 224Mb
  • Cassandra data file is 557Mb
  • Cassandra index file is 109Mb

So I got almost 2x overhead when storing log lines from a log file.

Is it possible to tune Cassandra in some way so it wont eat so much disk space for very simple scenarios?

+2  A: 

I guess you mean one row (with four columns) inside your column family? The "overhead" associated with each column is a a long (timestamp, 64 bits) and a byte[] (column name, max 64 kb). So 4x disk usage seems a little bit weird. Are you doing any deletes? Be sure to understand how deletes are done in a distributed, eventually consistent system.

Be sure to read about "compactions" also. ("Once compaction is finished, the old SSTable files may be deleted")

Would also like to remind you of a Thrift limitation regarding how streaming is done.

Cassandra's public API is based on Thrift, which offers no streaming abilities -- any value written or fetched has to fit in memory. This is inherent to Thrift's design and is therefore unlikely to change. So adding large object support to Cassandra would need a special API that manually split the large objects up into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265. As a workaround in the meantime, you can manually split files into chunks of whatever size you are comfortable with -- at least one person is using 64MB -- and making a file correspond to a row, with the chunks as column values. (From the 'Cassandra Limitations' page on the wiki)

Schildmeijer
Schildmeijer, actually I was wrong with Cassandra disk space usage when I submitted my question (you are right, I didn't run compaction). So here are true figures (I also updated original question):- Original log size is 224Mb- Cassandra data file is 557Mb- Cassandra index file is 109MbI'm not doing any deletes. I put every log line into Cassandra separately and the longest line is about 1kb.Still 2x overhead is somewhat big for my purpose storing longs - is there any way to optimize that?Thanks!
sha1dy