views:

5756

answers:

4

And what are the pitfalls to avoid? Are there any deal breaks for you? E.g., I've heard that exporting/importing the Cassandra data is very difficult, making me wonder if that's going to hinder syncing production data to development environment.

BTW, it's very hard to find good tutorials on Cassandra, the only one I have http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model is still pretty basic.

Thanks.

+18  A: 

For me, the main thing is a decision whether to use the OrderedPartitioner or RandomPartitioner.

If you use the RandomPartitioner, range scans are not possible. This means that you must know the exact key for any activity, INCLUDING CLEANING UP OLD DATA.

So if you've got a lot of churn, unless you have some magic way of knowing exactly which keys you've inserted stuff for, using the random partitioner you can easily "lose" stuff, which causes a disc space leak and will eventually consume all storage.

On the other hand, you can ask the ordered partitioner "what keys do I have in Column Family X between A and B" ? - and it'll tell you. You can then clean them up.

However, there is a downside as well. As Cassandra doesn't do automatic load balancing, if you use the ordered partitioner, in all likelihood all your data will end up in just one or two nodes and none in the others, which means you'll waste resources.

I don't have any easy answer for this, except you can get "best of both worlds" in some cases by putting a short hash value (of something you can enumerate easily from other data sources) on the beginning of your keys - for example a 16-bit hex hash of the user ID - which will give you 4 hex digits, followed by whatever the key is you really wanted to use.

Then if you had a list of recently-deleted users, you can just hash their IDs and range scan to clean up anything related to them.

The next tricky bit is secondary indexes - Cassandra doesn't have any - so if you need to look up X by Y, you need to insert the data under both keys, or have a pointer. Likewise, these pointers may need to be cleaned up when the thing they point to doesn't exist, but there's no easy way of querying stuff on this basis, so your app needs to Just Remember.

And application bugs may leave orphaned keys that you've forgotten about, and you'll have no way of easily detecting them, unless you write some garbage collector which periodically scans every single key in the db (this is going to take a while - but you can do it in chunks) to check for ones which aren't needed any more.

None of this is based on real usage, just what I've figured out during research. We don't use Cassandra in production.

EDIT: Cassandra now does have secondary indexes in trunk.

MarkR
Very informative, many thanks.
Jerry
I thought that the 'automatic load balancing' issue raised above is important enough to warrant its own thread... which I started at http://stackoverflow.com/questions/1767789/cassandra-load-balancing thanks
deepblue
0.5 does do semiautomatic load balancing. (The "semi" means an operator has to request it, but then Cassandra takes care of the rest.) 0.5 beta2 was released last week and an RC is coming soon.
jbellis
+5  A: 

Another tutorial is: here http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/.

Alice
+3  A: 

Are there any deal breaks for you? Not necessarily deal breakers but something to be aware of

  1. A client connects to a nearest node, which address it should know beforehand, all communications with all other Cassandra nodes proxied through it. a. read/write traffic is not evenly distributed among nodes - some nodes proxy more data than they host themselves b. Should the node go down, the client is helpless, can’t read, can’t write anywhere in the cluster.

  2. Although Cassandra claims that “writes never fail” they do fail, at least at the moment of speaking they do. Should the target data node become sluggish, request times out and write fails. There are many reason for a node to become unresponsive: garbage collector kicks in, compaction process, whatever… In all such cases all write/read request fail. In a conventional database these requests would have become proportionally slow, but in Cassandra they just fail.

  3. There is multi-get but there is no multi-delete and one can’t truncate ColumnFamily either

  4. Should a new, empty data node enter the cluster, portion of data from one neighbor nodes on the key-ring will be transfered only. This leads to uneven data distribution and uneven load. You can fix it by always doubling number of nodes.One should also keep track on tokens manually and select them wisely.

+5  A: 

This was too long to add as a comment, so to clear up some misconceptions from the list-of-problems reply:

  1. Any client may connect to any node; if the first node you pick (or you connect to via a load balancer) goes down, simply connect to another. Additionally, a "fat client" api is available where the client can direct the writes itself; an example is on http://wiki.apache.org/cassandra/ClientExamples

  2. Timing out when a server is unresponsive rather than hanging indefinitely is a feature that most people who have dealt with overloaded rdbms systems has wished for. The Cassandra RPC timeout is configurable; if you wish, you are free to set it to several days and deal with hanging indefinitely instead. :)

  3. It is true that there is no multidelete or truncation support yet, but there are patches for both of these in review.

  4. There is obviously a tradeoff in keeping load balanced across cluster nodes: the more perfectly balanced you try to keep things, the more data movement you will do, which is not free. By default, new nodes in a Cassandra cluster will move to the optimal position in the token ring to minimize uneven-ness. In practice, this has been shown to work well, and the larger your cluster is, the less true it is that doubling is optimal. This is covered more in http://wiki.apache.org/cassandra/Operations

jbellis