Fault Tolerance

From Mpich
Revision as of 22:01, 25 February 2015 by Wbland (talk | contribs) (Device)

Jump to: navigation, search

The work done to allow fault tolerance was started by Darius Buntinas and continued by Wesley Bland.


In order to use the fault tolerance features of MPICH, users need to enable a flag at configure time:


This will allow the MPICH implementation to correctly check the status of communicators when calling MPI operations.

Users will also need to enable a runtime flag for the Hydra process manager:


This will prevent the process manager from automatically killing all processes when any process exits abnormally.

Within the application, the most basic code required to take advantage of any fault tolerance features is to change the error handler of the user's communicators to at least MPI_ERRORS_RETURN. At the moment, fault tolerance is only implemented for the ch3:tcp device. The other devices will require some changes in order to correctly return errors up through the stack.


Error Reporting


Local failures are detected by Hydra, the process manager via the usual Unix methods (closed local socket). If a process terminates abnormally, it is detected by the process manager in and SIGUSR1 is used to notify the MPI application of the failure. This notification is also sent to the PMI server to be broadcast to all other processes so they can also raise SIGUSR1.


As mentioned previously, TCP is currently the only netmod that supports fault tolerance. It is done by detecting that a socket is closed unexpectedly. When that happens, the netmod calls the tcp cleanup function (MPID_nem_tcp_cleanup_on_error) and returns an error (MPIX_ERR_PROC_FAILED) via the usual MPICH error handling methods.

TODO: Add more details here.