The work done to allow fault tolerance was started by Darius Buntinas and continued by Wesley Bland.
In order to use the fault tolerance features of MPICH, users need to enable a flag at configure time:
This will allow the MPICH implementation to correctly check the status of communicators when calling MPI operations.
Users will also need to enable a runtime flag for the Hydra process manager:
This will prevent the process manager from automatically killing all processes when any process exits abnormally.
Within the application, the most basic code required to take advantage of any fault tolerance features is to change the error handler of the user's communicators to at least MPI_ERRORS_RETURN. At the moment, fault tolerance is only implemented for the ch3:tcp device. The other devices will require some changes in order to correctly return errors up through the stack.
Local failures are detected by Hydra, the process manager via the usual Unix methods (closed local socket). If a process terminates abnormally, it is detected by the process manager in and SIGUSR1 is used to notify the MPI application of the failure. This notification is also sent to the PMI server to be broadcast to all other processes so they can also raise SIGUSR1.