|
|
Line 1: |
Line 1: |
| This page describes how to use the checkpointing capability of MPICH. | | This page describes how to use the checkpointing capability of MPICH. |
| | | |
− | == Configuration ==
| + | BLCR is no longer support in latest MPICH. |
− | | |
− | First, you need to have BLCR version 0.8.2 or later installed on your
| |
− | machine. If it's installed in the default system location, add the
| |
− | following two options to your configure command:
| |
− | | |
− | --enable-checkpointing
| |
− | --with-hydra-ckpointlib=blcr
| |
− | | |
− | If BLCR is not installed in the default system location, you'll need
| |
− | to tell MPICH's configure where to find it. You might also need to
| |
− | set the LD_LIBRARY_PATH environment variable so that BLCR's shared
| |
− | libraries can be found. In this case add the following options to
| |
− | your configure command:
| |
− | | |
− | --enable-checkpointing
| |
− | --with-hydra-ckpointlib=blcr
| |
− | --with-blcr=<BLCR_INSTALL_DIR>
| |
− | LD_LIBRARY_PATH=<BLCR_INSTALL_DIR>/lib
| |
− | | |
− | where <BLCR_INSTALL_DIR> is the directory where BLCR has been
| |
− | installed (whatever was specified in --prefix when BLCR was
| |
− | configured).
| |
− | | |
− | After it's configured compile as usual (e.g., make; make install).
| |
− | | |
− | Note, checkpointing is only supported with the Hydra process manager.
| |
− | | |
− | | |
− | == Verifying Checkpointing Support ==
| |
− | | |
− | Make sure MPICH is correctly configured with BLCR. You can do this
| |
− | using:
| |
− | | |
− | mpiexec -info
| |
− | | |
− | This should display 'BLCR' under 'Checkpointing libraries available'.
| |
− | | |
− | | |
− | == Checkpointing the Application ==
| |
− | | |
− | There are two ways to cause the application to checkpoint. You can ask
| |
− | mpiexec to periodically checkpoint the application using the mpiexec
| |
− | option -ckpoint-interval (seconds):
| |
− | | |
− | mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint \
| |
− | -ckpoint-interval 3600 -f hosts -n 4 ./app
| |
− | | |
− | Alternatively, you can also manually force checkpointing by sending a
| |
− | SIGUSR1 signal to mpiexec.
| |
− | | |
− | The checkpoint/restart parameters can also be controlled with the
| |
− | environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and
| |
− | HYDRA_CKPOINT_INTERVAL.
| |
− | | |
− | To restart a process:
| |
− | | |
− | mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4 -ckpoint-num <N>
| |
− | | |
− | where <N> is the checkpoint number you want to restart from.
| |
− | | |
− | | |
− | ----
| |
| | | |
| Notes on the implementation of checkpointing on Nemesis can be found [[Checkpointing_implementation|here]]. | | Notes on the implementation of checkpointing on Nemesis can be found [[Checkpointing_implementation|here]]. |
This page describes how to use the checkpointing capability of MPICH.
BLCR is no longer support in latest MPICH.