views:

982

answers:

2

The Hbase documentation makes it clear that you should group similar columns into column families, because the physical storage is done by column family.

But what does it mean to put two column families into the same table, as opposed to having separate tables per column group? Are there specific cases when "partitioning" tables this way makes more sense, and cases when one "wide" table works better?

Separate tables should result in separate "row regions", which could be beneficial when some column families (as a whole) are very sparse. Conversely, when would it be advantageous to have columns families bunched together?

+1  A: 

You've got the idea of column families right on: basically it's just a hint to HBase to store and replicate these items together for faster access.

If you put two column families in the same table and always have different keys to access them, then it's really the same thing as having them in two separate tables. You only gain by having two column families in the same table that are accessed via the same keys.

For example: if I have columns for the total number of pageviews for a given web site, the number of unique views for the same site, the browser the user uses to view the site, and their internet connection, I can decide that I want the first two to be a column family and the last two to be another column family. Here all four are accessed by the same key, namely the web site in question, so I'm gaining by having them in the same table.

If they're in different tables I would end up having to do a join-like operation on the two tables. I don't really know the numbers though so I can't really tell you how slow the join-like operation is (since I don't recall HBase having a join since it's non-relational) and what the tipping point is where splitting them into separate tables outweighs having them in the same table (or vice versa).

Of course, this all depends on the data you're trying to store, so if you would never need to join across the tables, you would want to keep them in separate tables since you could argue they're not that related to each other in the first place.

Chris Bunch
You say "Join is expensive". That seems to imply that a "join" between column groups within the same table is cheaper than a join of column groups across tables. Is that the case? The HBase docs do not make that clear, I think.
Thilo
I would think it's much cheaper to do a 'join' between columns in the same table since it's just a 'get' operation with the two columns named and is a primitive in the query language. 'Join', however, isn't a primitive and you'd need to implement it on your own (which takes more operations).
Chris Bunch
+1  A: 

Column families are a compromise between row-oriented vs. column-oriented access. To extend Chris's web page example, a row access would fetch all data (columns) for a single web site. An example of a column-oriented operation would be to sum the number of page views across all sites.

The latter operation does not require the browser and connection details, which are much larger than the numeric values for view counts and would significantly affect query performance. Therefore, HBase provides column families as an optimisation that supports column operations.

As to whether or not the columns should be in the same table... I would just follow normal data modelling guidelines and put all the columns in the same table if they are attributes of the same entity. Column families are about performance not schema.

Greg Cottman