views:

413

answers:

3

I'm about to start a mapreduce project which will run on AWS and I am presented with a choice, to either use Java or C++.

I understand that writing the project in Java would make more functionality available to me, however C++ could pull it off too, through Hadoop Streaming.

Mind you, I have little background in either language. A similar project has been done in C++ and the code is available to me.

So my question: is this extra functionality available through AWS or is it only relevant if you have more control over the cloud? Is there anything else I should bear in mind in order to make a decision, like availability of plugins for hadoop that work better with one language or the other?

Thanks in advance

A: 

It depends on your needs. What is your input/output? Is it a simple text files? Records with new line delimiters? Do you need a special combiner? partitioner?

What i mean is, that if you need only the hadoop basics, than streaming will be fine. But if you need a little more complexity (from the hadoop framework, not from your own business logic), hadoop jar will be more flexible.

Sagie

sagie
well my input will be one big text sequence, I presume in the 1-100 GB region. I'll need to chop the sequence in pieces. Can't tell you if I need some special combiner or partitioner, as I am yet to program hadoop on my own - still in the "reading tutorials" phase.Will all this added flexibility be available through AWS or maybe they have turned stuff off for security etc?
aeolist
I am only at the begining of using AWS as well.As far as I can tell, If you use M/R to process text files with a well known records format, It doesn't realy matter if you use Hadoop Jar or Streaming. Choose the one you are more comfortable with (Java vs. C++).If you'll need to create your own customized input/output formats, if you'll need to start using HBase, etc. go for java. You won't have that flexability in streaming.BTW, what about hadoop pipes?
sagie
+3  A: 

Hey Aeolist,

You have a few options for running Hadoop on AWS. The simplest is to run your MapReduce jobs via their Elastic MapReduce service: http://aws.amazon.com/elasticmapreduce. You could also run a Hadoop cluster on EC2, as described at http://archive.cloudera.com/docs/ec2.html.

If you suspect you'll need to write your own input/output formats, partitioners, and combiners, I'd recommend using Java with the latter system. If your job is relatively simple and you don't plan to use your Hadoop cluster for any other purpose, I'd recommend choosing the language with which you are most comfortable and using EMR.

Either way, good luck!

Disclosure: I am a founder of Cloudera.

Regards, Jeff

Jeff Hammerbacher
thanks for your answer, i've read some of cloudera's presentations, they were really helpful
aeolist
+1  A: 

I decided the flexibility of Java was more important than dealing with the possible shortcomings of adjusting my current code from C++ to Java.

Thanks for all your answers.

aeolist