Fault Tolerance

From Mpich
Revision as of 21:51, 25 February 2015 by Wbland (talk | contribs)

Jump to: navigation, search

The work done to allow fault tolerance was started by Darius Buntinas and continued by Wesley Bland.

Usage

In order to use the fault tolerance features of MPICH, users need to enable a flag at configure time:

   --enable-error-checking=all

This will allow the MPICH implementation to correctly check the status of communicators when calling MPI operations.

Users will also need to enable a runtime flag for the Hydra process manager:

   --disable-auto-cleanup

This will prevent the process manager from automatically killing all processes when any process exits abnormally.

Within the application, the most basic code required to take advantage of any fault tolerance features is to change the error handler of the user's communicators to at least MPI_ERRORS_RETURN. At the moment, fault tolerance is only implemented for the ch3:tcp device. The other devices will require some changes in order to correctly return errors up through the stack.

Implementation

Error Reporting

Hydra

Local failures are detected by Hydra, the process manager via the usual Unix methods (closed local socket). If a process terminates abnormally, it is detected by the process manager in and SIGUSR1 is used to notify the MPI application of the failure. This notification is also sent to the PMI server to be broadcast to all other processes so they can also raise SIGUSR1.

Device

API

MPI_COMM_FAILURE_ACK / MPI_COMM_FAILURE_GET_ACKED

MPI_COMM_SHRINK

MPI_COMM_AGREE

MPI_COMM_REVOKE