I am starting on a new Hadoop project that will have multiple hadoop jobs(and hence multiple jar files). Using mercurial for source control, I was wondering what would be optimal way of organizing the repository structure? Should each job live in separate repo or would it be more efficient to keep them in the same, but break down into folders?
+1
A:
If you're pipelining the Hadoop jobs (output of one is the input of another), I've found it's better to keep most of it in the same repository since I tend to generate a lot of common methods I can use in the various MR jobs.
Personally, I keep the streaming jobs in a separate repo from my more traditional jobs since there are generally no dependencies.
Are you planning on using the DistributedCache or streaming jobs? You might want a separate directory for files you distribute. Do you really need a JAR per Hadoop job? I've found I don't.
If you give more details about what you plan on doing with Hadoop, I can see what else I can suggest.
Eric Wendelin
2010-06-02 04:34:44
Thanks Eric. I won't planning on doing any streaming of jobs yet(may get there in the future, but not yet). Project is very young and sort of growing, so I am curious as to how to layout a good foundation that can accommodate further project growth.
Alex N.
2010-06-02 22:51:02