Searches (and general querying) with HBase and/or Cassandra (best practices?)

I have User model object with quite few fields (properties, if you wish) in it. Say "firstname", "lastname", "city" and "year-of-birth". Each user also gets "unique id".

I want to be able to search by them. How do I do that properly? How to do that at all?

My understanding (will work for pretty much any key-value storage -- first goes key, then value)

u:123456789 = serialized_json_object

("u" as a simple prefix for user's keys, 123456789 is "unique id").

Now, thinking that I want to be able to search by firstname and lastname, I can save in:

f:Steve = u:384734807,u:2398248764,u:23276263 f:Alex = u:12324355,u:121324334

so key is "f" - which is prefix for firstnames, and "Steve" is actual firstname. For "u:Steve" we save as value all user id's who are "Steve's".

That makes every search very-very easy. Querying by few fields (properties) -- say by firstname (i.e. "Steve") and lastname (i.e. "l:Anything") is still easy - first get list of user ids from "f:Steve", then list from "l:Anything", find crossing user ids, an here you go.

Problems (and there are quite a few):

Saving, updating, deleting user is a pain. It has to be atomic and consistent operation. Also, if we have size of value limited to some value - then we are in (potential) trouble. And really not of an answer here. Only zipping the list of user ids? Not too cool, though.
What id we want to add new field to search by. Eventually. Say by "city". We certainly can do the same way "c:Los Angeles" = ..., "c:Chicago" = ..., but if we didn't foresee all those "search choices" from the very beginning, then we will have to be able to create some night job or something to go by all existing User records and update those "c:CITY" for them... Quite a big job!
Problems with locking. User "u:123" updates his name "Alex", and user "u:456" updates his name "Alex". They both have to update "f:Alex" with their id's. That means either we get into overwriting problem, or one update will wait for another (and imaging if there are many of them?!).

What's the best way of doing that? Keeping in mind that I want to search by many fields?

P.S. Please, the question is about HBase/Cassandra/NoSQL/Key-Value storages. Please please - no advices to use MySQL and "read about" SELECTs; and worry about scaling problems "later". There is a reason why I asked MY question exactly the way I did. :-)

Being able to query properties directly is one of the features you lose when moving away from SQL, so you need a way to maintain your own index to let you find records.

If your datastore does not have built in indexing or atomic list operations, you will need to deal with the locking issues you mention. However, indexing doesn't necessarily need to be synchronous - maintain a queue of updated records to be reindexed and you have a solution for 3 that can be reused to solve 2 also.

If the index list for a particular value becomes too large for the system to handle in a single list, you can replace the list of users with a list of lists. However, if you have that many records with the same value it probably isn't a particularly useful search criteria anyway.

Another option that is useful in some cases is to use a seperate system for the indexing - for example you could set up lucene to index the records in your main datastore.

Can you suggest best way of implementing the described functionality for Cassandra/HBase? My "own" way is just my guess, what I would like to find out, what's the best practices here are.

alexeypro 2010-04-12 15:16:07

All this is a bit new for real best practices to exist, but the method you describe is fairly close to standard. The main improvement you need is to use whatever support your chosen platform has for lists, so that you can add items without loading the whole list. With Cassandra you would probably use a supercolumn. If you have a pure key-value store like memcached, you can implement lists as values, but you need to implement locks and possibly queues also.

Tom Clarkson 2010-04-13 07:05:30

ansaurus

tags:

views:

answers:

Searches (and general querying) with HBase and/or Cassandra (best practices?)

related questions