views:

161

answers:

1

I'm in the process of spiking a conversion from MySQL to Cassandra for PenWag.com. In Cassandra, I'm storing Users keyed off of a GUID, but users sign in with their email, not the GUID (obviously). GUID as a key for Users makes sense to me more than email for two reasons. From a practical perspective it seems that it's too cumbersome to change or delete/add a row with all of its SuperColumns. From a theoretical standpoint, it's still the same user, why should their key change?

Nevertheless, here's my question: I'm building an index in a separate ColumnFamily, mapping email->GUID to support login. It's a Standard type CF, where the column name is email, and the value is GUID. It's Standard, not Super, to avoid loading an entire SC for every mapping. Supporting "change email" is easy, it's just a column delete/add. But it seems that an alternative to this is to store the index as rows instead of columns, where the row key is email, and a column holds the GUID. Delete/add on those rows would not be cumbersome, since there's only column (the GUID) to manage.

It seems that either approach works. What are the pros and cons of each? Is there a best practice?

+1  A: 

Since I have no hands-on experience with Cassandra or similar databases, you'll need to take my answer with a grain of salt :)

If you'd store each mapping as a column, using the email address as the column name, this would imply a single row containing an enormous amount of columns. According to Wikipedia[1]:

Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.

This could result in significant locking overhead if all mappings are stored in a single row.

The Cassandra Wiki states[2]:

The row key is what determines what machine data is stored on.

This makes me believe that it's more efficient to do lookups based on row key than on column name. Based on this information, I would suggest to use the email address as the row key and store the GUID in the column.

Niels van der Rest
I too would use their email address as the key and GUID as column. So you have one column and many rows. That way the data can be spread across the cluster.
Zanson