I do a lot of work in the grid and HPC space and one of the biggest challenges we have with a system distributed across hundreds (or in some case thousands) of servers is analysing the log files.
Currently log files are written locally to the disk on each blade but we could also consider publishing logging information using for example a UDP Appender and collect it centally.
Given that the objective is to be able to identify problems in as close to real time as possible, what should we do?