ansaurus

Question

Hadoop safemode recovery - taking lot of time

Answer 1

A:

The time spent in safe mode is usually proportional to the size of the cluster. That said, normal time is on the order of minutes at most, not hours. There are a few things to check.

Confirm all data nodes are firing up correctly. It's normal for data nodes to take a few seconds or minutes for a large number of blocks to report in. Check the data node logs to see what's happening during start up.
Ensure you have enough name node threads (dfs.namenode.handler.count in hdfs-site.xml) to be able to take care of the number of data nodes that want to check in. The default is 10 which should be fine for clusters up to 20 nodes or so. Beyond that, it may make sense to increase this. You may see retries occurring in the data node logs that would indicate this. This is what the retry messages seems to indicate to me (e.g. retry 21 times).

Hope this helps.

Eric Sammer 2010-07-11 05:23:02

ansaurus

tags:

views:

answers:

Hadoop safemode recovery - taking lot of time

related questions