tags:

views:

71

answers:

1
+1  Q: 

HBase schema help

Coming from a SQL Server background, I'm a newbie with regard to HBase, but the technology looks to be a good fit for what we're doing and the cost is definitely right!

I need to maintain a list of log entries which normally I would create in an RDBS as:

create table Log ( UserID int, SiteID int, Page varchar(50), Date smalldatetime )

where one user may have 0 or 1000 rows in this simple table. Typical queries would be to find all the rows for one user or all the rows for one user on one site.

How does this translate into a "map" in HBase where there is no "row key" AND the same (SiteID,Page) may appear many times. My first thought is that UserID is a row key, but I still don't understand "column families" and the other terminology well enough to understand how to setup the table to hold this data where the one UserID can have many (SiteID,Page,Date) "rows".

Any direction is appreciated!

A: 

One approach would be to make compound row keys out of your userid+siteid

Set the table to maintain a however many log entries you want for a given page, and store your data as new versions each time(manually setting the timestamp if necessary).

Since HBase maintains timestamps for each cell, you don't need a separate column for the access time.

You would thus have a table with contents something like

Row             Page

user1:site1     www.example.com/index.html@1234567890
                www.example.com/somepage.html@123456800
                www.example.com/someotherpage.html@123456900
                www.example.com/index.html@123457123

user1:site2     blahblah

user2:site1     etc...

To deal with your two example requests:

For finding all user rows you would do a scan(be sure to set maxVersion) from userx:0 to userx+1:0, and then parse out the site ids from each results row

To get all pages for a specific user/site just do a scan from userx:sitex to userx:sitex+1. Last I checked you can't set maxVersions on a get, so that isn't an option.

To put it simply, column families represent groups of data that you want stored together... Presumably you would be reading data from them simultaneously quite often. Placing columns in separate families would result in the data being stored separately, so you get faster reads when you only want one column, but you need to read 2 different places to get both columns.

Of course depending on your other needs you may want to take a different approach. I would strongly recommend reading the big table paper to better understand the structure of HBase(since it is strongly based on bigtable).

To better understand the internals of HBase, Lars George's blog is also great.

juhanic