tags:

views:

65

answers:

2

I would love to get a sense if haddop is right tool for the problem I have.

I'm building offline process (once a month or one a quarter) that matches 2 data sets: A and B. The dataset A is located on Oracle, dataset B is an XML file. Dataset A is about 20M records, dataset B is 6M records.
Each record represents a musical song and has following format:

song {
  songid:

  // type of string , avg_num_of_titles_per_song=1.4 , std_deviation=1.9
  titles:[] 

  // type of string  avg_num_of_performers_per_song=1.5 std_deviation=0.9
  performers:[] 
}

The two records match if : - at least one title match, using either exact match or phonetic algorithm or distance algorithm - at least on performer match using same algorithms: exact, phonetic, distance, etc (we're still evaluating matching algorithms)

The output of this process is 2 data sets: (1) single matches, where record in A matches only once in B and same record in B matches only once in A. (2) multiple matches

Would hadoop be the right tool for the job ?

Thank you.

+1  A: 

It should work. However your datasets are not that big to justify the use of hadoop. You can probably just run it on a single beefy server. What you need to do is first to put the smaller dataset into the distcache to be evenly distributed over different nodes Then you can pull the second larger dataset out of oracle database and upload it into HDFS. Then launch a map job that will match two datasets. Producing the output is just standard map-reduce programming.

Vlad
+1  A: 

I agree with Vlad approach, but assuming that you data was big enough you can take a look at this excellent article on howto perform joins using Hive http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2855.

Sidharth Shah