I'm new to hadoop. I'd like to run some approaches with you that I came up with.
Problem:
2 datasets : A and B.
Both datasets represent songs: some top level attributes, titles (1..), performers (1..).
I need to match these datasets either using equality or fuzzy algorithms (such as levenshtein , jaccard, jaro-winkler, etc) based on titles and performer.
The dataset sizes are: A=20-30M , B~=1-6M.
So here there are approaches that I came up with:
Load dataset B(smallest) into HDFS. Use mapreduce against dataset A(biggest) , where:
map phase : for each record in A access HDFS and pull records B for matching;
reduce phase : writes id pairsload dataset A into distirubted cache (i.e. jboss cache) in optimized form to speed up searching. Use mapreduce against dataset B, where :
map phase: for each record in B query distributed cache for matching
reduce : writes id pairsuse mapreduce to join both datasets, where
map phase: gets a record from set A and set B , does matching
reduce phase: same
(I'm fuzzy about ths one. 1st: join will be the cartesian product with trillion of records; 2nd: not sure how hadoop can parallize that across cluster)use hive (i'm looking at right now trying to figure out how to plugin custom functions that will do string matching)
I'm loooking for a pointers, which approach would be the best candidate or maybe there are some other approaches that I do not see.