views:

111

answers:

1

Hi, I am running Hadoop 0.20.1 under SLES 10 (SUSE).

My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.

Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.

reduceTaskId = job.get("mapred.task.id"); reduceTempDir = job.get("mapred.temp.dir"); String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);

The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space. Thanks akintayo

(edit) I found that the best place to keep files that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.

A: 

The problem with that approach is that the sort and shuffle is going to move your data away from where that data was localized.

I do not know much about your data but the distributed cache might work well for you

${mapred.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs

http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

"It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.

The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back."

Joe Stein
My understanding is that DistributedCache is used for read only files, or rather files that are the same at all the nodes over a given run, e.g. a configuration file or a jar. My problem is that I am generating files during processing, which I may or may not keep, e.g. if I am taking a jpg and compressing it. Where would I place these files when I am working on them.Thanks
akintayo
How are you loading the files into HDFS or do you have them already on S3 or already in HDFS? You could use the mapper to stream the file in (so the file you want to pull from outside of HDFS would be in the input file you can read the line) and write out the compressed version to HDFS in the mapper or write out from job to some other store (e.g. Cassandra or MongoDB) and no HDFS at all. What are you doing with the files after you compress them? Is it just about storing the files in HDFS for backup and redundancy and compressing to save space?
Joe Stein
The files are contained in my input sequence file, I am recreating it then processing it in stages. After completing the processing, I am copying the result to an output sequence file. I have to use this workflow, I am trying to figure out where can I place files so they are available to the tasks, without slowing performance.Thanks
akintayo