tags:

views:

30

answers:

1

hi all, had a quick hadoop streaming question.. If I'm using python streaming and I have python packages my mappers/reducers require that aren't installed by default do I need to install those on all the hadoop machines as well or is there some sort of serialization that sends them to the remote machines?

thanks!

A: 

If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip

However, see this issue for a caveat:

https://issues.apache.org/jira/browse/MAPREDUCE-596

Karl Anderson