I've a problem storing 50Gb of logs each day in a distributed environment. I looked at Hadoop HDFS but because it has problems running on Windows infrastructure, lack of multi language filesystem API it doesn't suit me very well. Cassandra on the other hand is very easy to deploy on any platform. The only big problem I'm facing is a disk space usage. Here are the figures:
- Original log size is 224Mb
- Cassandra data file is 557Mb
- Cassandra index file is 109Mb
So I got almost 2x overhead when storing log lines from a log file.
Is it possible to tune Cassandra in some way so it wont eat so much disk space for very simple scenarios?