I would love to get a sense if haddop is right tool for the problem I have.
I'm building offline process (once a month or one a quarter) that matches 2 data sets: A and B.
The dataset A is located on Oracle, dataset B is an XML file. Dataset A is about 20M records, dataset B is 6M records.
Each record represents a musical song and has following format:
song {
songid:
// type of string , avg_num_of_titles_per_song=1.4 , std_deviation=1.9
titles:[]
// type of string avg_num_of_performers_per_song=1.5 std_deviation=0.9
performers:[]
}
The two records match if : - at least one title match, using either exact match or phonetic algorithm or distance algorithm - at least on performer match using same algorithms: exact, phonetic, distance, etc (we're still evaluating matching algorithms)
The output of this process is 2 data sets: (1) single matches, where record in A matches only once in B and same record in B matches only once in A. (2) multiple matches
Would hadoop be the right tool for the job ?
Thank you.