The work done to allow fault tolerance was started by Darius Buntinas and continued by Wesley Bland.
In order to use the fault tolerance features of MPICH, users need to enable a flag at configure time:
This will allow the MPICH implementation to correctly check the status of communicators when calling MPI operations.
Users will also need to enable a runtime flag for the Hydra process manager:
This will prevent the process manager from automatically killing all processes when any process exits abnormally.
Within the application, the most basic code required to take advantage of any fault tolerance features is to change the error handler of the user's communicators to at least MPI_ERRORS_RETURN. At the moment, fault tolerance is only implemented for the ch3:tcp device. The other devices will require some changes in order to correctly return errors up through the stack.
Local failures are detected by Hydra, the process manager via the usual Unix methods (closed local socket). If a process terminates abnormally, it is detected by the process manager in and SIGUSR1 is used to notify the MPI application of the failure. This notification is also sent to the PMI server to be broadcast to all other processes so they can also raise SIGUSR1.
As mentioned previously, TCP is currently the only netmod that supports fault tolerance. It is done by detecting that a socket is closed unexpectedly. When that happens, the netmod calls the tcp cleanup function (MPID_nem_tcp_cleanup_on_error) and returns an error (MPIX_ERR_PROC_FAILED) via the usual MPICH error handling methods.
TODO: Add more details here.
When a failure is detected and the signal SIGUSR1 is raised, it is caught by the CH3 progress engine. This is done by comparing the number of known SIGUSR1 signals to the current number. If it has changed, the progress engine calls into the cleanup functions. The first function is MPIDI_CH3U_Check_for_failed_procs. In this function, CH3 gets the most recent list of failed processes from the PMI server. Currently, value is replicated on all PMI instances, but in the future, it could do something more scalable and do a reduction to generate this value with local knowledge.
If some new failures are detected, they are handled by a series of calls, first MPIDI_CH3I_Comm_handle_failed_procs. This function loops over each communicator to determine if it contains a failed process. If it does, it is marked as not anysource_enabled. This means that future recv (or non-blocking receive) calls will probably return with a failure indicating that they could not complete. However, they will be given one chance through the progress engine to match so it's possible that they might still finish. Also during that function, the progress engine will be signalled to indicate that a request has completed. This will allow the progress engine to return control back to the MPI level where a decision can be made about whether to return an error or return to the progress engine to continue waiting for completion.
After cleaning up the communicator objects, the next function is terminate_failed_VCs. This function closes all of the VC connections for failed processes. This will also clean out the nemesis send queue to remove all messages to the failed processes.
Deep in the call stack, the function MPIDI_CH3U_Complete_posted_with_error is called which cleans up the posted receive queue. This is part of the process of cleaning up failed VCs so only requests which match the failed VCs are removed (when on is removed, it's printed to the log to make it easy to track).