views:

45

answers:

2

I have been looking at using MapReduce to build a parallelized record combining system. The language doesn't matter, I can use a pre-existing library such as Hadoop or build my own if necessary, I'm not worried about that.

The problem that I keep running into, however, is that I need the records to be matched on multiple criteria. For example: I may need to match the records based on person's name or the person's phone number, but not necessarily the person's name and phone number.

For instance, given the following keys for each record:

  1. 'John Smith' and '555-555-5555'
  2. 'Jane Smith' and '555-555-5555'
  3. 'John Smith' and '555-555-1111'

I want the system to take all three records, figure out that they match on one of the keys, and combine them into a single combined record that has both names ('John Smith' and 'Jane Smith') as well as both phone numbers ('555-555-5555' and '555-555-1111').

Is this something that I can accomplish using MapReduce? If so, how would I go about matching the keys produced by the Map function so that all of the matched records can be passed into the Reduce function.* Alternatively, is there a different/better way I could be doing this? My only real requirement is that I need it parallelized.

[*] Please note: I am assuming that the Reduce function could be used in such a way that each call to the Reduce function produces a single combined record, rather than the Reduce function producing a single result for the entire job.

A: 

I don't think Map is useful here, because you can't really create a meaningful key for each record that will help identify the groupings of records.

It is not possible to implement this using Reduce either. Consider the example you yourself gave... If you query for 'Jane Smith', you cannot detect at the time that the first record is related to the query and so will ignore it. In fact you could end up chaining names and numbers together until you've got every record in the file. The only way to pick up all the matches is iteratively scan over the list until you stop finding new links.

This is very easy to parallelize though, you can just share out the records amongst some number of threads, and each can search its own records for new links. I'd suggest treating these sets as rings of data, so that you can record the point you were searching with the most up to date information, and you know you're finished once all threads have done a complete loop.

KernelJ
+1  A: 
Ilya Haykinson

related questions