Difference between revisions of "Checkpointing"

From Mpich
Jump to: navigation, search
(Configuration)
Line 3: Line 3:
 
== Configuration ==
 
== Configuration ==
  
First, you need to have BLCR version 0.8.2 installed on your machine.  If it's installed in the default system location, add the following two options to your configure command:
+
First, you need to have BLCR version 0.8.2 or later installed on your
 +
machine.  If it's installed in the default system location, add the
 +
following two options to your configure command:
  
   --enable-checkpointing --with-hydra-ckpointlib=blcr
+
   --enable-checkpointing
 +
  --with-hydra-ckpointlib=blcr
  
If BLCR is not installed in the default system location, you'll need to tell MPICH2's configure where to find it. You might also need to set the LD_LIBRARY_PATH environment variable so that BLCR's shared libraries can be found.  In this case add the following options to your configure command:
+
If BLCR is not installed in the default system location, you'll need
 +
to tell MPICH2's configure where to find it. You might also need to
 +
set the LD_LIBRARY_PATH environment variable so that BLCR's shared
 +
libraries can be found.  In this case add the following options to
 +
your configure command:
  
   --enable-checkpointing --with-hydra-ckpointlib=blcr --with-blcr=''<u>BLCR_INSTALL_DIR</u>'' LD_LIBRARY_PATH=''<u>BLCR_INSTALL_DIR</u>''/lib
+
   --enable-checkpointing
 +
  --with-hydra-ckpointlib=blcr
 +
  --with-blcr=<BLCR_INSTALL_DIR>
 +
  LD_LIBRARY_PATH=<BLCR_INSTALL_DIR>/lib
  
where ''<u>BLCR_INSTALL_DIR</u>'' is the directory where BLCR has been installed (whatever was specified in --prefix when BLCR was configured).
+
where <BLCR_INSTALL_DIR> is the directory where BLCR has been
 +
installed (whatever was specified in --prefix when BLCR was
 +
configured).
  
Note, checkpointing is only supported with the Hydra process manager. Hyrda will used by default, unless you choose something else with the --with-pm= configure option.
+
After it's configured compile as usual (e.g., make; make install).
 +
 
 +
Note, checkpointing is only supported with the Hydra process manager.
 +
 
 +
 
 +
== Verifying Checkpointing Support ==
 +
 
 +
Make sure MPICH2 is correctly configured with BLCR. You can do this
 +
using:
 +
 
 +
  mpiexec -info
 +
 
 +
This should display 'BLCR' under 'Checkpointing libraries available'.
 +
 
 +
 
 +
== Checkpointing the Application ==
 +
 
 +
There are two ways to cause the application to checkpoint. You can ask
 +
mpiexec to periodically checkpoint the application using the mpiexec
 +
option -ckpoint-interval (seconds):
  
After it's configured compile as usual (e.g., make; make install).
+
  mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint \
 +
      -ckpoint-interval 3600 -f hosts -n 4 ./app
 +
 
 +
Alternatively, you can also manually force checkpointing by sending a
 +
SIGUSR1 signal to mpiexec.
 +
 
 +
The checkpoint/restart parameters can also be controlled with the
 +
environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and
 +
HYDRA_CKPOINT_INTERVAL.
 +
 
 +
To restart a process:
  
== Running an Application ==
+
  mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4 -ckpoint-num <N>
  
See the hydra [[Using_the_Hydra_Process_Manager#Checkpoint.2FRestart_Support|documentation]] for information on running an application with checkpointing.
+
where <N> is the checkpoint number you want to restart from.
  
  

Revision as of 03:04, 4 January 2011

This page describes how to use the checkpointing capability of MPICH2.

Configuration

First, you need to have BLCR version 0.8.2 or later installed on your machine. If it's installed in the default system location, add the following two options to your configure command:

 --enable-checkpointing
 --with-hydra-ckpointlib=blcr

If BLCR is not installed in the default system location, you'll need to tell MPICH2's configure where to find it. You might also need to set the LD_LIBRARY_PATH environment variable so that BLCR's shared libraries can be found. In this case add the following options to your configure command:

 --enable-checkpointing
 --with-hydra-ckpointlib=blcr
 --with-blcr=<BLCR_INSTALL_DIR>
 LD_LIBRARY_PATH=<BLCR_INSTALL_DIR>/lib

where <BLCR_INSTALL_DIR> is the directory where BLCR has been installed (whatever was specified in --prefix when BLCR was configured).

After it's configured compile as usual (e.g., make; make install).

Note, checkpointing is only supported with the Hydra process manager.


Verifying Checkpointing Support

Make sure MPICH2 is correctly configured with BLCR. You can do this using:

 mpiexec -info

This should display 'BLCR' under 'Checkpointing libraries available'.


Checkpointing the Application

There are two ways to cause the application to checkpoint. You can ask mpiexec to periodically checkpoint the application using the mpiexec option -ckpoint-interval (seconds):

 mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint \
     -ckpoint-interval 3600 -f hosts -n 4 ./app

Alternatively, you can also manually force checkpointing by sending a SIGUSR1 signal to mpiexec.

The checkpoint/restart parameters can also be controlled with the environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and HYDRA_CKPOINT_INTERVAL.

To restart a process:

 mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4 -ckpoint-num <N>

where <N> is the checkpoint number you want to restart from.



Notes on the implementation of checkpointing on Nemesis can be found here.