ansaurus

Question

Answer 1

+1 A:

It should work. However your datasets are not that big to justify the use of hadoop. You can probably just run it on a single beefy server. What you need to do is first to put the smaller dataset into the distcache to be evenly distributed over different nodes Then you can pull the second larger dataset out of oracle database and upload it into HDFS. Then launch a map job that will match two datasets. Producing the output is just standard map-reduce programming.

Vlad 2010-03-04 03:45:52

Answer 2

+1 A:

I agree with Vlad approach, but assuming that you data was big enough you can take a look at this excellent article on howto perform joins using Hive http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2855.

Sidharth Shah 2010-03-04 06:01:32

ansaurus

tags:

views:

answers:

Matching large datasets using Hadoop ?

related questions