The work done to allow fault tolerance was started by Darius Buntinas and continued by Wesley Bland.
In order to use the fault tolerance features of MPICH, users need to enable a flag at configure time:
This will allow the MPICH implementation to correctly check the status of communicators when calling MPI operations.
Users will also need to enable a runtime flag for the Hydra process manager:
This will prevent the process manager from automatically killing all processes when any process exits abnormally.
Within the application, the most basic code required to take advantage of any fault tolerance features is to change the error handler of the user's communicators to at least MPI_ERRORS_RETURN. At the moment, fault tolerance is only implemented for the ch3:tcp device. The other devices will require some changes in order to correctly return errors up through the stack.