views:

67

answers:

4

Hi, I wish to know what are the methods exist to check the Health of a process. Considering that on a system 10000 process are running and you have to make sure that in case any of these process goes down we need to make the process UP.

+2  A: 

Use the Process ID (PID) and poll whether the process is still alive or is dead periodically; and if it's dead, then revive it.

However, if you have 10000 process, you will probably hit the OS's process limit first. I suggest redesigning your program so you don't need that much processes in the first place.

Lie Ryan
Ryan, You are correct we can do poll of the process IDs but lets say I have PID from 1 to 1000000 (assuming we have allowed to create that many process) in that case if something happen to PID lets say 2 and in case our polling is at PID 3 then it will take from 3 to 1000000 PID then to PID 2 to detect that PID 2 is not exists. Hence more latency decrease in performance.If I do like this that instead of polling if could register some callback that will come when application is down. How about it? (Event based)Will this solution will have any problem in scalability wise ?
Arpit
How is a callback going to be any faster than polling? They are both O(n) so unless the primitive used for the callback is significantly faster than that used for the PID poll there won't be much difference.
TomMD
+1  A: 

Re-spawning processes that go down is usually handled by having specific launcher programs to exec() the program and wait for a SIGCHILD to indicate the child process ended.

For boot time applications (servers etc) daemons like upstart can do this for you automatically.

stsquad
A: 

there is an utility called monit which does what you are looking for. But it is for certain important processes in Linux.. all 10000 processes are important !!!

thegeek
+1  A: 

While others are pointing out that applications already exists (which you really should use unless you have a clear reason not to) I'll throw out a random idea for a custom solution.

If you control all N processes then make them all have one shared memory area N bits large (so, 10000 processes ~ 1KB, not bad). When starting each process give it a number, i, ranging from 0 to N. Every T seconds have each process will set bit i in the shared memory to 1. A monitoring process can check that all N bits are 1 every k*T seconds, resetting them all to 0 in the process.

This is still O(n), which you won't avoid, but the primitives are all really fast and should scale fine up to the OS thread limit.

An alternate idea for obtaining i would be just to use the PID, but then the shared memory will have to be larger (probably will still be OK though; for example, the Linux PID range is small).

TomMD
Setting a bit in Shared memory sounds good but this will not work in all the cases like legacy linux services.
Arpit