Difference between revisions of "Checkpointing"

From Mpich
Jump to: navigation, search
Line 1: Line 1:
 
This page describes how to use the checkpointing capability of MPICH.
 
This page describes how to use the checkpointing capability of MPICH.
  
== Configuration ==
+
BLCR is no longer support in latest MPICH.
 
 
First, you need to have BLCR version 0.8.2 or later installed on your
 
machine.  If it's installed in the default system location, add the
 
following two options to your configure command:
 
 
 
  --enable-checkpointing
 
  --with-hydra-ckpointlib=blcr
 
 
 
If BLCR is not installed in the default system location, you'll need
 
to tell MPICH's configure where to find it. You might also need to
 
set the LD_LIBRARY_PATH environment variable so that BLCR's shared
 
libraries can be found.  In this case add the following options to
 
your configure command:
 
 
 
  --enable-checkpointing
 
  --with-hydra-ckpointlib=blcr
 
  --with-blcr=<BLCR_INSTALL_DIR>
 
  LD_LIBRARY_PATH=<BLCR_INSTALL_DIR>/lib
 
 
 
where <BLCR_INSTALL_DIR> is the directory where BLCR has been
 
installed (whatever was specified in --prefix when BLCR was
 
configured).
 
 
 
After it's configured compile as usual (e.g., make; make install).
 
 
 
Note, checkpointing is only supported with the Hydra process manager.
 
 
 
 
 
== Verifying Checkpointing Support ==
 
 
 
Make sure MPICH is correctly configured with BLCR. You can do this
 
using:
 
 
 
  mpiexec -info
 
 
 
This should display 'BLCR' under 'Checkpointing libraries available'.
 
 
 
 
 
== Checkpointing the Application ==
 
 
 
There are two ways to cause the application to checkpoint. You can ask
 
mpiexec to periodically checkpoint the application using the mpiexec
 
option -ckpoint-interval (seconds):
 
 
 
  mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint \
 
      -ckpoint-interval 3600 -f hosts -n 4 ./app
 
 
 
Alternatively, you can also manually force checkpointing by sending a
 
SIGUSR1 signal to mpiexec.
 
 
 
The checkpoint/restart parameters can also be controlled with the
 
environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and
 
HYDRA_CKPOINT_INTERVAL.
 
 
 
To restart a process:
 
 
 
  mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4 -ckpoint-num <N>
 
 
 
where <N> is the checkpoint number you want to restart from.
 
 
 
 
 
----
 
  
 
Notes on the implementation of checkpointing on Nemesis can be found [[Checkpointing_implementation|here]].
 
Notes on the implementation of checkpointing on Nemesis can be found [[Checkpointing_implementation|here]].

Revision as of 01:04, 26 June 2016

This page describes how to use the checkpointing capability of MPICH.

BLCR is no longer support in latest MPICH.

Notes on the implementation of checkpointing on Nemesis can be found here.