ansaurus

Question

HBase schema help

Answer 1

A:

One approach would be to make compound row keys out of your userid+siteid

Set the table to maintain a however many log entries you want for a given page, and store your data as new versions each time(manually setting the timestamp if necessary).

Since HBase maintains timestamps for each cell, you don't need a separate column for the access time.

You would thus have a table with contents something like

Row             Page

user1:site1     www.example.com/index.html@1234567890
                www.example.com/somepage.html@123456800
                www.example.com/someotherpage.html@123456900
                www.example.com/index.html@123457123

user1:site2     blahblah

user2:site1     etc...

To deal with your two example requests:

For finding all user rows you would do a scan(be sure to set maxVersion) from userx:0 to userx+1:0, and then parse out the site ids from each results row

To get all pages for a specific user/site just do a scan from userx:sitex to userx:sitex+1. Last I checked you can't set maxVersions on a get, so that isn't an option.

To put it simply, column families represent groups of data that you want stored together... Presumably you would be reading data from them simultaneously quite often. Placing columns in separate families would result in the data being stored separately, so you get faster reads when you only want one column, but you need to read 2 different places to get both columns.

Of course depending on your other needs you may want to take a different approach. I would strongly recommend reading the big table paper to better understand the structure of HBase(since it is strongly based on bigtable).

To better understand the internals of HBase, Lars George's blog is also great.

juhanic 2010-05-15 10:40:34

ansaurus

tags:

views:

answers:

HBase schema help

related questions