views:

58

answers:

1

I've seen two contradictory pieces of advice when it comes to designing row IDs in HBase, (specifically, but I think it applies to Cassandra as well.)

  1. Group keys that you'll be aggregating together often to take advantage of data locality. (White, Hadoop: The Definitive Guide and I recall seeing it on the HBase site, but can't find it...)
  2. Spread keys around so that work can be distributed across multiple machines (Twitter, Pig, and HBase at Twitter slide 14)

I'm guessing which one is optimal can depend on your use case, but does anyone have any experience with either strategy?

+1  A: 

In HBase, a table is partitioned into regions by dividing up the key space, which is sorted lexicographically. Each region of the table belongs to a single region server, so all reads and writes are handled by that server (which allows for a strong consistency guarantee). This means that if all of your reads or writes are concentrated on a small range of your keyspace, that you will only be able to scale to what a single region server can handle. For example, if your data is a time series and keyed by the timestamp, then all writes are going to the last region in the table, and you will be constrained to writing at the rate that a single server can handle.

On the other hand, if you can choose your keys such that any given query only needs to scan a small range of rows, but that the overall set of reads and writes are spread across your keyspace, then the total load will be distributed and scale nicely, but you can still enjoy the locality benefits for your query.

Dave L.