views:

789

answers:

5

Given the following HBase schema scenario (from the official FAQ)...

How would you design an Hbase table for many-to-many association between two entities, for example Student and Course?

I would define two tables:

Student: student id student data (name, address, ...) courses (use course ids as column qualifiers here)

Course: course id course data (name, syllabus, ...) students (use student ids as column qualifiers here)

This schema gives you fast access to the queries, show all classes for a student (student table, courses family), or all students for a class (courses table, students family).

How would you satisfy the request: "Give me all the students that share at least two courses in common"? Can you build a "query" in HBase that will return that set, or do you have to retrieve all the pertinent data and crunch it yourself in code?

+2  A: 

This type of querying is not available through the 0.20.0 API. I'm not sure if there are any plans for it (I doubt it would appear anytime soon). You'll find some roadmap details on the HBase website that might answer that question.

You'll need to compute the answer in your own application (although I'd love to be proved wrong).

bradhouse
+1  A: 

The query as described is better suited to a relational database. You can answer the query quickly, however, by precomputing the result. For example, you might have a table where the key is the number of classes in common, and the cells are individual students that have key-many classes in common.

You could use a variant on this to answer questions like "which students are in class X and class Y": use the classes as pieces of the key (in alphabetical ordering, or something at least consistent), and again, each column is a student.

jonathan-stafford
+1  A: 

Use a filer to achieve this.

SingleValueFiler filer = new SingleValueFiler( and your arguments based on the api );

add this to Scan ( org.apache.hadoop.hbase.client.Scan scan = new Scan(); scan.setFiler(filter);

WackoMax
Could you please expand on your example pseudo-code incorporating the students/courses from the question to demonstrate how a SingleValueFilter would accomplish the task?
Teflon Ted
A: 

@WackoMax: (not sure how to post reply to his answer :( ) I could find no SingleValueFiler (or SingleValueFilter for that matter), nor org.apache.hadoop.hbase.client.Scan, nor org.apache.hadoop.hbase.client.(Record)Scanner.setFil(t)er. Are you sure it can be done? The way I understand HBase architecture it can only be done by creating and maintaining a separate table or with an expensive on-the-fly joining of tables.

I would love to hear more answers, though - I'm still learning HBase too.

A: 

Seems like MapReduce could be one way to solve this; unfortunately it wouldn't give an instant result if it is done on the fly. Just thinking through it you could, in the map phase, count the number of times a pair of students end up in the same class. During the reduce phase you could sum the pairs and write out (emit) the pairs that had a sum of 2 or more. This approach could be used to pre-generate an index (as suggested earlier) that indicates the pairs of students with "x" courses in common. The key to such an index could be something along the lines of "X/Student1_Key/Student2_Key", where X is the number of courses they have in common and. A range scan over the index (e.g., X>=2) would give you your answer. Given HBase's native integration with MapReduce a solution along these lines should be straightforward.

Also, following the BigTable model, you wouldn't even need to create two tables. Just precede each record key with a "kind" such as Course: or Student:. Since the rows are ordered lexicographically they are easily scanned by kind. Populate (or generate) the columns needed to support properties for each kind. Since HBase supports highly sparse tables this works well. See this excellent presentation on selecting keys and developing indices with BigTable: http://www.google.com/events/io/2009/sessions/BuildingScalableComplexApps.html. This presentation really helped me understand how to store things in databases such as HBase for efficient retrieval.

But back to the original question, it seems that when working with HBase you really have to know how your data is to be used so appropriate indices can be developed beforehand to get quick answers. It doesn't appear that random ad-hoc queries will always work out with this model.

Anyway, I'm also new to this so seeing problems like these and possible solutions helps!

jesse-daniels