I am trying a small hadoop setup (for experimentation) with just 2 machines. I am loading about 13GB of data, a table of around 39 million rows, with a replication factor of 1 using Hive. My problem is hadoop always stores all this data on a single datanode. Only if I change the dfs_replication fatcor to 2 using setrep, hadoop copies data on the other node. I also tried the balancer ($HADOOP_HOME/bin/start-balancer.sh -threshold 0). The balancer recognizes that it needs to move around 5GB to balance. But says: "No block can be moved. Exiting..." and exits.
2010-07-05 08:27:54,974 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Using a threshold of 0.0 2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.252.130.177:1036 2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.220.222.64:1036 2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over utilized nodes: 10.220.222.64:1036 2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 under utilized nodes: 10.252.130.177:1036 2010-07-05 08:27:56,997 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 5.42 GB bytes to make the cluster balanced.
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved No block can be moved. Exiting... Balancing took 2.222 seconds
Can anybody suggest how to achieve even distribution of data on hadoop, without replication?