views:

622

answers:

3

Hi,

I have gone through some articles regarding Bigtable and NOSQL. It is very interesting that they avoid JOIN operations.

As a basic example, let's take Employee and Department table and assume the data is spread across multiple tables / servers.

Just want to know, if data is spread across multiple servers, how do we do JOIN or UNION operations?

+3  A: 

You would have to do multiple selects, and join the data manually in your application. See this SO post for more information. From that post:

Bigtable datasets can be queried from services like AppEngine using a language called GQL ("gee-kwal") which is a based on a subset of SQL. Conspicuously missing from GQL is any sort of JOIN command. Because of the distributed nature of a Bigtable database, performing a join between two tables would be terribly inefficient. Instead, the programmer has to implement such logic in his application, or design his application so as to not need it.

Kaleb Brasee
+2  A: 

Kaleb's right. You write custom code with a NoSQL solution if your data doesn't fit well into a key-value store. Map-reduce/async processing and custom view caches are common. Brian Aker gave a very funny (and satirical and biased) presentation at the Nov 2009 OpenSQLCamp http://www.youtube.com/watch?v=LhnGarRsKnA. Skip in 40 seconds to hear about joins.

Ken Fox
+1 for the funny video
Igor Brejc
+2  A: 

When you have extremely large data, you probably want to avoid joins. This is because the overhead of an individual key lookup is relatively large (the service needs to figure out which node(s) to query, and query them in parallel and wait for responses). By overhead, I mean latency, not throughput limitation.

This makes joins suck really badly as you'd need to do a lot of foreign key lookups, which would end up going to many,many different nodes (in many cases). So you'd want to avoid this as a pattern.

If it doesn't happen very often, you could probably take the hit, but if you're going to want to do a lot of them, it may be worth "denormalising" the data.

The kind of stuff which gets stored in NoSQL stores is typically pretty "abnormal" in the first place. It is not uncommon to duplicate the same data in all sorts of different places to make lookups easier.

Additionally most nosql don't (really) support secondary indexes either, which means you have to duplicate stuff if you want to query by any other criterion.

If you're storing data such as employees and departments, you're really better off with a conventional database.

MarkR
+1 well said...
gbn