views:

31

answers:

1

Hi All,

Loving MRToolkit -- great to get away from Java while writing Hadoop jobs. It has become apparent that the library was written to interface with an EC2 cluster, and not with Amazon's elastic map/reduce system. Does anybody have insights into running jobs defined using the toolkit on elastic map/reduce servers? It isn't readily apparent from the web interface, and I'd love to avoid the headache of setting up a cluster by hand on EC2.

I've looked into updloading files under the 'streaming' option (as that's what MRToolkit uses), but Amazon is expecting separate files for the mapper and reducer -- typical MRToolkit style defines them in the a single file as subclasses of predefined Base(Map|Reduce) classes.

Thanks much for any thoughts.

Isaac

+1  A: 

It's doable, but not through the web GUI.

  • Download and install the Ruby Client
  • Create your cluster: elastic-mapreduce --create --alive [params to size cluster]
  • Confirm your Elastic Map Reduce Master security group has port 22 open
  • SSH into your master node
  • Use git / scp to copy over your application code
  • Run your app
Ryan Cox
Ryan,Thanks for the pointers. I've noticed that EMR lets you specify input and output buckets/directories on S3 -- do you know if there's a way to leverage that functionality w/ MRToolkit instead of manually copying it over (with something like s3cmd)?Again, thanks much.Isaac
isparling
Just use the syntax: s3n://my-input-bucket/prod/logs... Hadoop can cope with the s3 protocol and pull the data directly from s3.
Ryan Cox