views:

171

answers:

2
+3  Q: 

Hadooop map reduce

Im very much new to map reduce and i completed hadoop wordcount example.

In that example it produces unsorted file (with key value) of word counts. So is it possible to make it sorted according to the most number of word occurrences by combining another map reduce task to the earlier one.

Thanks in Advance

+1  A: 

The output from the Hadoop MapReduce wordcount example is sorted by the key. So the output should be in alphabetical order.

With Hadoop you can create your own key objects that implement the WritableComparable interface allowing you to override the compareTo method. This allows you to control the sort order.

To create an output that is sorted by the number of occurances you would probably have to add another MapReduce job to process the output from the first as you have said. This second job would be very simple, maybe not even requiring a reduce phase. You would just need to implement your own Writable key object to wrap the word and its frequency. A custom writable looks something like this:

 public class MyWritableComparable implements WritableComparable {
       // Some data
       private int counter;
       private long timestamp;

       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }

       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }

       public int compareTo(MyWritableComparable w) {
         int thisValue = this.value;
         int thatValue = ((IntWritable)o).value;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }
     }

I grabbed this example from here.

You should probably override hashCode, equals and toString as well.

Binary Nerd
A: 

In Hadoop sorting is done between the Map and the Reduce phases. One approach to sort by word occurance would be to use a custom group comparator that doesn't group anything; therefore, every call to reduce is just the key and one value.

public class Program {
   public static void main( String[] args) {

      conf.setOutputKeyClass( IntWritable.class);
      conf.setOutputValueClass( Text.clss);
      conf.setMapperClass( Map.class);
      conf.setReducerClass( IdentityReducer.class);
      conf.setOutputValueGroupingComparator( GroupComparator.class);   
      conf.setNumReduceTasks( 1);
      JobClient.runJob( conf);
   }
}

public class Map extends MapReduceBase implements Mapper<Text,IntWritable,IntWritable,Text> {

   public void map( Text key, IntWritable value, OutputCollector<IntWritable,Text>, Reporter reporter) {
       output.collect( value, key);
   }
}

public class GroupComaprator extends WritableComparator {
    protected GroupComparator() {
        super( IntWritable.class, true);
    }

    public int compare( WritableComparable w1, WritableComparable w2) {
        return -1;
    }
}
Jon Snyder