views:

83

answers:

2

I am starting to study map-reduce databases. How can one implement a reference in a map-reduce database, such as CouchDB or MongoDB? For example, suppose that I have drivers and cars, and I want to mark that some driver drives a car. In SQL it's something like:

SELECT person_id, car_id FROM driver, car WHERE driver.car = car.car_id

(That is, if my memory serves right - I haven't programmed in SQL for a while.)

In languages that has references it is ever simpler: An instance of Person can point to instances of Car.

What is the map-reduce equivalent to this sort of relations?

+1  A: 

In document databases, you can embed the related objects in the document that owns the objects, e.g. the driver document also contains all the cars that belong to the driver. That's the power of document databases; they allow you to easily store denormalized data.

{
  "_id": "joe_the_driver",
  "name": "Joe",
  "cars": [
    { "_id": "123-AB", /* car properties */ },
    { "_id": "456-YZ", /* car properties */ }
  ]
}

This format only works for one-to-many relations. If the relation between driver and car is many-to-many, you'll have to create look-up documents:

{
  "_id": "joe_the_driver",
  "car_ids": [ /* ID's that refer to car documents */ ]
}

{
  "_id": "123-AB",
  "driver_ids": [ /* ID's that refer to driver documents */ ]
}

It's important to note that most document databases have no way to enforce relations between documents in the way a SQL database does. Your application is responsible for enforcing and maintaining these relationships.

Niels van der Rest
+1  A: 

In CouchDB you would write a map/reduce that outputs ALL of the cars and drivers with complex keys, and then use key ranges to pick both. For example, let's assume your documents look like these two...

{
  "_id": "...",
  "_rev": "...",
  "docType": "driver"
}

{
  "_id": "...",
  "_rev": "...",
  "docType": "car",
  "driver": "driver's _id"
}

You could use duck typing instead of specifying the docType, but I like this method better.

Your map function:

function(doc)
{
  if(doc.docType == "driver")
    emit([doc.id, 0], doc);
  elseif(doc.docType == "car")
    emit([doc.driver, 1], doc];
}

Our complex key is an array, with the first item always being the driver's _id. The second item in the array prevents key collision, and allows us to reference the car or driver directly (more on this later).

We can now use the key range query parameters to grab both of the docs.

?startkey=["driver _id"]&endkey=["driver _id", {}]

This basically says "give me any array with the driver _id as the first item, and anything in the second. This works because objects - the second item in the endkey's array - is sorted as the highest. See http://wiki.apache.org/couchdb/View_collation?redirect=ViewCollation#Collation_Specification for more information about how items get sorted/weighed in keys.

This also scales quite nicely, because we can add more information into our map function without having to change our query in the client. Let's say we add a sponsor docType: we just add another elseif for the docType field and then emit([doc.driver, 2], doc);. Now we can pull all three documents in one request with the same key range query from above.

Of course, you can also specify individual documents instead of pulling all of them. ?key=["driver's _id", 1] would pull just the car for the specified driver.

Cheers.

Sam Bisbee