Difference between revisions of "Using the Hydra Process Manager"

From Mpich
Jump to: navigation, search
(Process-core Binding)
(Examples)
 
(86 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
__TOC__
 
__TOC__
  
'''This wiki page only provides information on the external usage of Hydra. If you are looking for the internal workings of Hydra, you can find it [http://wiki.mcs.anl.gov/mpich2/index.php/Hydra_Process_Management_Framework here].'''
+
'''This wiki page only provides information on the external usage of Hydra. If you are looking for the internal workings of Hydra, you can find it [[Hydra_Process_Management_Framework|here]].'''
 +
 
  
 
== General ==
 
== General ==
  
Hydra is a process management system for starting parallel jobs. Hydra is designed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and sge. However, in the current release, only ssh, rsh and fork are supported, with a preliminary version of slurm available.
+
Hydra is a process management system for starting parallel jobs. Hydra is designed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and sge.
 +
 
 +
 
 +
 
 +
== Getting Help ==
  
 +
All of the runtime options can be listed on the command line using the flag:
 +
<pre>
 +
mpiexec -help
 +
</pre>
  
 +
For further information about any specific flag, the information can be obtained by adding the option before the help flag:
 +
<pre>
 +
mpiexec -<option> -help
 +
</pre>
 
== Quick Start ==
 
== Quick Start ==
  
Starting MPICH2-1.1, hydra is compiled into MPICH2 releases by default as a alternate process manager. You can use it as '''mpiexec.hydra'''.
+
Once built, the Hydra executables are in the bin subdirectory of the install directory if you have done an install. You should put this (bin) directory in your PATH in your .cshrc or .bashrc for usage convenience:
 
 
Once built, the Hydra executables are in mpich2/bin, or the bin subdirectory of the install directory if you have done an install. You should put this (bin) directory in your PATH in your .cshrc or .bashrc for usage convenience:
 
  
 
<pre>
 
<pre>
Put in .cshrc:  setenv PATH /home/you/mpich2/bin:$PATH
+
Put in .cshrc:  setenv PATH /home/you/mpich/bin:$PATH
  
Put in .bashrc: export PATH=/home/you/mpich2/bin:$PATH
+
Put in .bashrc: export PATH=/home/you/mpich/bin:$PATH
 
</pre>
 
</pre>
  
Line 41: Line 52:
 
shell$ mpiexec -f hosts -n 4 ./app
 
shell$ mpiexec -f hosts -n 4 ./app
 
</pre>
 
</pre>
 +
 +
If the same application is run with 8 processes, it will use the hosts in a round robin manner.  For instance, with the above host file, ranks 0 and 4 will execute on donner, ranks 1 and 5 on foo, ranks 2 and 6 on shakey and ranks 3 and 7 on terra.
  
 
The host file can also be specified as follows:
 
The host file can also be specified as follows:
 +
 +
<pre>
 +
shell$ cat hosts
 +
  donner:2
 +
  foo:2
 +
  shakey:2
 +
  terra:2
 +
</pre>
 +
 +
In this case, ranks 0-1 will execute on donner, ranks 2-3 on foo, ranks 4-5 on shakey and ranks 6-7 on terra.
 +
 +
A more complex host file layout can be:
  
 
<pre>
 
<pre>
Line 62: Line 87:
 
   shakey:2    # The last 2 procs run on this host
 
   shakey:2    # The last 2 procs run on this host
 
</pre>
 
</pre>
 +
 +
You can spawn multiple different executables using:
 +
 +
<pre>
 +
shell$ mpiexec -f hosts -n 4 ./app1 : -n 4 ./app2
 +
</pre>
 +
 +
The first four ranks will be of "app1" and the next 4 ranks will be of "app2".
 +
 +
 +
== Environment Settings ==
 +
 +
By default, hydra passes inherits environment variables from the shell on which mpiexec is launched, and passes it to the launched processes. However, each launcher disables the automatic propagation of some environment variables. For example, the SSH launcher disables the propagation of "DISPLAY", the SLURM launcher disables the propagation of "SLURM_*" variables, etc.
 +
 +
Users can force an environment variable to be propagated using the -genv option.
 +
 +
 +
Environment variables can also be used to control several settings in Hydra:
 +
 +
'''HYDRA_HOST_FILE:''' This variable points to the default host file to use, when the "-f" option is not provided to mpiexec.
 +
 +
<pre>
 +
  For bash:
 +
    export HYDRA_HOST_FILE=<path_to_host_file>/hosts
 +
 +
  For csh/tcsh:
 +
    setenv HYDRA_HOST_FILE <path_to_host_file>/hosts
 +
</pre>
 +
 +
'''HYDRA_DEBUG:''' Setting this to "1" enables debug mode; set it to "0" to disable.
 +
 +
'''HYDRA_ENV:''' Setting this to "all" will pass all of the launching node's environment to the application processes. By default, if nothing is set, the launching node's environment is passed to the executables, as long as it does not overwrite any of the environment variables that have been preset by the remote shell.
 +
 +
'''HYDRA_LAUNCHER_EXTRA_ARGS:''' The value of this environment variable is appended to the launcher. This can be used, for example, to pass the "-q" argument to the launcher by setting HYDRA_LAUNCH_EXTRA_ARGS=-q.
 +
 +
'''MPIEXEC_TIMEOUT:''' The value of this environment variable is the maximum number of seconds this job will be permitted to run.  When time is up, the job is aborted.
 +
 +
'''MPIEXEC_PORT_RANGE:''' If this environment variable is defined then Hydra will restrict its usage of ports for connecting its various processes to ports in this range.  If this variable is not assigned, but '''MPICH_PORT_RANGE''' '''is''' assigned, then it will use the range specified by '''MPICH_PORT_RANGE''' for its ports.  Otherwise, it will use whatever ports are assigned to it by the system.  Port ranges are given as a pair of integers separated by a colon.
 +
 +
'''HYDRA_PROXY_RETRY_COUNT:''' The value of this environment variable determines the number of retries a proxy does to establish a connection to the main server.
  
  
 
== Hydra with Non-Ethernet Networks ==
 
== Hydra with Non-Ethernet Networks ==
  
If you want to use Hydra with TCP/IP on the non-default network, you just need to specify those IP addresses in your hostfile.
+
There are two ways to use Hydra with TCP/IP over the non-default network.
 +
 
 +
The first way is using the -iface option to mpiexec to specify which network interface to use. For example, if your Myrinet network's IP emulation is configured on myri0, you can use:
 +
 
 +
<pre>
 +
shell$ mpiexec -f hostfile -iface myri0 -n 4 ./app1
 +
</pre>
 +
 
 +
Similarly, if your InfiniBand network's IP emulation is configured on ib0, you can use:
 +
 
 +
<pre>
 +
shell$ mpiexec -f hostfile -iface ib0 -n 4 ./app1
 +
</pre>
 +
 
 +
You can also control this using the '''HYDRA_IFACE''' environment variable.
 +
 
 +
 
 +
The second way is to specify the appropriate IP addresses in your hostfile.
  
 
<pre>
 
<pre>
Line 134: Line 216:
 
</pre>
 
</pre>
  
== Environment Settings ==
 
  
'''HYDRA_HOST_FILE:''' This variable points to the default host file to use, when the "-f" option is not provided to mpiexec.
+
== Resource Managers and Launchers ==
  
<pre>
+
Hydra natively interacts with a number of resource managers and launchers.
  For bash:
 
    export HYDRA_HOST_FILE=<path_to_host_file>/hosts
 
  
  For csh/tcsh:
+
Resource managers provide information about the resources allocated by the user (e.g., slurm, pbs, loadleveler, lsf, sge, cobalt).  Launchers allow mpiexec to launch processes on the system (e.g., ssh, rsh, fork, slurm, pbs, loadleveler, lsf, sge).  Some tools act as both resource managers and launchers, while others play just one role.
    setenv HYDRA_HOST_FILE <path_to_host_file>/hosts
 
</pre>
 
  
'''HYDRA_DEBUG:''' Setting this to "1" enables debug mode; set it to "0" to disable.
+
'''In most cases, you don't need to do anything special for Hydra to work with your slurm, pbs, loadleveler, lsf, sge or cobalt environment.  It'll automatically detect these resource managers and interact with them.  The following command will be sufficient to launch processes on all the allocated cores:'''
  
'''HYDRA_ENV:''' Setting this to "all" will pass all the environment to the application processes.
+
<pre>
 +
shell$ mpiexec ./app
 +
</pre>
  
'''HYDRA_PROXY_PORT:''' The port to use for the proxies.
+
The above command will allow Hydra to inherit information about the resource manager and launcher available, resources allocated from the resource manager, and finally launch the appropriate number of processes.
  
 
+
However, if you do not want Hydra to auto-detect the launcher, you can customize it using the -launcher flag to mpiexec or the '''HYDRA_LAUNCHER''' environment variable.
== Bootstrap Servers ==
 
 
 
A bootstrap server is the basic remote node access mechanism that is provided on any system. Hydra supports multiple bootstrap servers including ssh, rsh, fork, and slurm to launch processes. All of these are compiled in by default, so you can pick any one of them at runtime using the mpiexec option -bootstrap:
 
  
 
<pre>
 
<pre>
shell$ mpiexec -bootstrap ssh -f hosts -n 4 ./app
+
shell$ mpiexec -launcher ssh -f hosts -n 4 ./app
  
 
  (or)
 
  (or)
  
shell$ mpiexec -bootstrap fork -f hosts -n 4 ./app
+
shell$ mpiexec -launcher fork -f hosts -n 4 ./app
 
</pre>
 
</pre>
  
This can also be controlled by using the '''HYDRA_BOOTSTRAP''' environment variable.
+
The executable to use as the launcher can also be customized using the option -launcher-exec or the '''HYDRA_LAUNCHER_EXEC''' environment variable:
 
 
The default bootstrap server is ssh.
 
 
 
The executable to use as the bootstrap server can be specified using the option -bootstrap-exec:
 
  
 
<pre>
 
<pre>
  $ mpiexec -bootstrap ssh -bootstrap-exec /usr/bin/ssh -f hosts -n 4 ./app
+
  $ mpiexec -launcher ssh -launcher-exec /usr/bin/ssh -f hosts -n 4 ./app
 
</pre>
 
</pre>
  
This can also be specified using the '''HYDRA_BOOTSTRAP_EXEC''' environment variable. If the bootstrap executable is not specified, Hydra will automatically look for it in your path and other known locations.
 
  
 
+
Similarly, if you do not want Hydra to auto-detect the resource manager, you can customize using the -rmk flag to mpiexec or the '''HYDRA_RMK''' environment variable.
== Process-core Binding ==
 
 
 
On supported platforms, Hydra automatically configures available process-core binding capability (currently using PLPA). We support multiple levels of allocation strategies:
 
 
 
* '''Basic allocation strategies:''' There are two forms of basic allocation: (i) based on a round-robin mechanism using the OS specified processor IDs, and (ii) based on a user-defined mapping. Further, for the user-defined mapping, two schemes are provided---command-line and host-file based. The command-line scheme lets the user specify a common-mapping for all physical nodes on the command line. The host-file scheme is the most general and lets the user specify the mapping for each node separately.
 
 
 
The modes of process binding in the basic allocation are: round-robin ("rr") and user-defined ("user").
 
  
 
<pre>
 
<pre>
shell$ mpiexec -binding rr -f hosts -n 8 ./app
+
shell$ mpiexec -rmk pbs ./app
 
</pre>
 
</pre>
  
Within the user-defined binding, two modes are supported: command-line and host-file based. The command-line based mode can be used as follows:
+
Finally, if you do not want Hydra to auto-detect the number of processes allocated or the machines allocated, you can customize that using the -n and -f flags.
  
 
<pre>
 
<pre>
shell$ mpiexec -binding user:0,3 -f hosts -n 4 ./app
+
shell$ mpiexec -rmk pbs -n 4 -f ~/hosts ./app
 
</pre>
 
</pre>
  
If a machine has 4 processing elements, and only two bindings are provided (as in the above example), the rest are padded with (-1), which refers to no binding. Also, the mapping is the same for all machines; so if the application is run with 8 processes, the first 2 processes on "each machine" are bound to processing elements as specified.
+
== Process-core Binding ==
  
The host-file based mode for user-defined binding can be used by the "map=" argument on each host line. E.g.:
+
On supported platforms, Hydra automatically configures available process-core binding capability using [http://www.open-mpi.org/projects/hwloc/ hwloc]. More info on how to use these options can be found in the help text:
  
<pre>
+
  shell$ mpiexec -bind-to -help
shell$ cat hosts
 
  donner:4    map=0,-1,-1,3
 
  foo:4      map=3,2
 
  shakey:2
 
</pre>
 
  
Using this method, each host can be given a different mapping. Any unspecified mappings are treated as (-1), referring to no binding.
+
*'''-bind-to <object[:num]>''' - Specify the hardware element to bind each process to
 +
*'''-map-by <object[:num]>''' - Skip over this many elements between bindings. If not specified, defaults to the same value as -bind-to.
  
Command-line based mappings are given a higher priority than the host-file based mappings. So, if a mapping is given at both places, the host-file mappings are ignored.
+
Binding options can also be controlled with the environment variables '''HYDRA_TOPOLIB''', '''HYDRA_BINDING''', '''HYDRA_MAPPING''', and '''HYDRA_MEMBIND'''.
  
* '''Topology-aware allocation strategies:''' These are a bit more intelligent in that they try to understand the system processing unit topology and assign processes in that order. Currently, "dense" and "sparse" schemes are provided. The "dense" scheme packs everything as closely as it can; this tries to maximize resource sharing hoping that the communication library can take advantage of this packing for better performance. The "sparse" scheme loops between all the available processing units to minimize the resource sharing (assuming the closer the processes are the more resources that they share).
+
=== Examples ===
  
Different modes of process binding in the topology-aware allocation are:
+
For the purposes of process binding/mapping examples, we will assume the following machine topology.
  
'''CPU based options:'''
+
[[File:Topo4.png]]
# sparse -- use all CPU resources
 
# sparse:sockets,cores,threads -- use all CPU resources
 
# sparse:sockets,cores -- avoid using multiple threads on a core
 
# sparse:sockets -- avoid using multiple cores on a socket
 
# dense -- use all CPU resources
 
# dense:sockets,cores,threads -- use all CPU resources
 
# dense:sockets,cores -- avoid using multiple threads on a core
 
# dense:sockets -- avoid using multiple cores on a socket
 
  
'''Memory based options:'''
+
Setting the env variable '''HYDRA_TOPO_DEBUG=1''' will force hydra to print out bindings, but not call the binding API. This is useful for testing binding options using hwloc's arbitrary topology loading feature, as is demonstrated below.
# sparse:l1,l2,l3,mem -- use all memory resources
 
# sparse:l2,l3,mem -- avoid sharing l1 cache
 
# sparse:l3,mem -- avoid using l2 cache
 
# sparse:mem -- avoid using l3 cache
 
# dense:l1,l2,l3,mem -- use all memory resources
 
# dense:l2,l3,mem -- avoid sharing l1 cache
 
# dense:l3,mem -- avoid using l2 cache
 
# dense:mem -- avoid using l3 cache
 
  
<pre>
+
To bind processes to single hardware threads:
shell$ mpiexec -binding dense -f hosts -n 8 ./app
+
  shell$ HYDRA_TOPO_DEBUG=1 mpiexec -n 8 -bind-to hwthread /bin/true | sort -k 2 -n
 +
  process 0 binding: 10000000
 +
  process 1 binding: 00100000
 +
  process 2 binding: 00001000
 +
  process 3 binding: 00000010
 +
  process 4 binding: 01000000
 +
  process 5 binding: 00010000
 +
  process 6 binding: 00000100
 +
  process 7 binding: 00000001
  
(or)
+
To bind processes to 2 cores at a time:
 +
  shell$ HYDRA_TOPO_DEBUG=1 mpiexec -n 8 -bind-to core:2 /bin/true | sort -k 2 -n
 +
  process 0 binding: 10100000
 +
  process 1 binding: 00001010
 +
  process 2 binding: 01010000
 +
  process 3 binding: 00000101
 +
  process 4 binding: 10100000
 +
  process 5 binding: 00001010
 +
  process 6 binding: 01010000
 +
  process 7 binding: 00000101
  
shell$ mpiexec -binding dense:sockets,cores -f hosts -n 8 ./app
+
A common use-case is to bind processes to a socket, and map by hwthread to group sequential ranks together.
 +
  shell$ HYDRA_TOPO_DEBUG=1 mpiexec -n 8 -bind-to socket -map-by hwthread /bin/true | sort -k 2 -n
 +
  process 0 binding: 10101010
 +
  process 1 binding: 10101010
 +
  process 2 binding: 10101010
 +
  process 3 binding: 10101010
 +
  process 4 binding: 01010101
 +
  process 5 binding: 01010101
 +
  process 6 binding: 01010101
 +
  process 7 binding: 01010101
  
(or)
+
Also included in the MPICH source is a program for printing out the affinity of a process ([https://github.com/pmodels/mpich/blob/main/src/pm/hydra/examples/print_cpus_allowed.c src/pm/hydra/examples/print_cpus_allowed.c]) according to the OS. This can be used on Linux systems to test that bindings are working correctly.
 +
  shell$ mpiexec -n 8 -bind-to socket ./print_cpus_allowed | sort
 +
  crush[0]: Cpus_allowed_list:    0,2,4,6
 +
  crush[1]: Cpus_allowed_list:    1,3,5,7
 +
  crush[2]: Cpus_allowed_list:    0,2,4,6
 +
  crush[3]: Cpus_allowed_list:    1,3,5,7
 +
  crush[4]: Cpus_allowed_list:    0,2,4,6
 +
  crush[5]: Cpus_allowed_list:    1,3,5,7
 +
  crush[6]: Cpus_allowed_list:    0,2,4,6
 +
  crush[7]: Cpus_allowed_list:    1,3,5,7
  
shell$ mpiexec -binding sparse:l1,l2,l3,mem -f hosts -n 6 ./app
+
== X Forwarding ==
</pre>
 
  
Consider the following layout of processing elements in the system (e.g., two nodes, each with two processors, and each processor with
+
X-forwarding is specific to each bootstrap server. Some servers do it by default, while some don't. For ssh, this is disabled by default. To force-enable it, you should use the option -enable-x to mpiexec.
two cores). Suppose the Operating System assigned processor IDs for each of these processing elements are as shown below:
 
  
 
<pre>
 
<pre>
__________________________________________      __________________________________________
+
shell$ mpiexec -enable-x -f hosts -n 4 ./app
|  _________________    _________________  |    |  _________________    _________________  |
 
| |  _____  _____  |  |  _____  _____  | |    | |  _____  _____  |  |  _____  _____  | |
 
| | |    | |    | |  | |    | |    | | |    | | |    | |    | |  | |    | |    | | |
 
| | |    | |    | |  | |    | |    | | |    | | |    | |    | |  | |    | |    | | |
 
| | |  0  | |  2  | |  | |  1  | |  3  | | |    | | |  0  | |  2  | |  | |  1  | |  3  | | |
 
| | |    | |    | |  | |    | |    | | |    | | |    | |    | |  | |    | |    | | |
 
| | |_____| |_____| |  | |_____| |_____| | |    | | |_____| |_____| |  | |_____| |_____| | |
 
| |_________________|  |_________________| |    | |_________________|  |_________________| |
 
|__________________________________________|    |__________________________________________|
 
 
</pre>
 
</pre>
  
In this case, the binding options are as follows:
 
  
# '''RR:''' 0, 1, 2, 3 (use the order provided by the OS)
+
== Checkpoint/Restart Support ==
# '''Sparse:''' 0, 1, 2, 3 (increasing sharing of resources)
+
 
# '''Dense:''' 0, 2, 1, 3 (closest packing)
+
Hydra provides checkpoint/restart capability. Currently, only BLCR is supported.  To use checkpointing  include the -ckpointlib option for mpiexec to specify the checkpointing library to use and -ckpoint-prefix to specify the directory where the checkpoint images should be written:
# '''User:''' as defined by the user
 
  
Binding options can also be controlled with the environment variable '''HYDRA_BINDING'''.
+
<pre>
 +
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -f hosts -n 4 ./app
 +
</pre>
  
== X Forwarding ==
+
While the application is running, the user can request for a checkpoint at any time by sending a SIGUSR1 signal to mpiexec.
  
X-forwarding is specific to each bootstrap server. Some servers do it by default, while some don't. For ssh, this is disabled by default. To force-enable it, you should use the option -enable-x to mpiexec.
+
You can also automatically checkpoint the application at regular intervals using the mpiexec option -ckpoint-interval to specify the number of seconds between checkpoints:
  
 
<pre>
 
<pre>
shell$ mpiexec -enable-x -f hosts -n 4 ./app
+
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -ckpoint-interval 3600 -f hosts -n 4 ./app
 
</pre>
 
</pre>
  
 +
The checkpoint/restart parameters can be controlled with the environment variables '''HYDRA_CKPOINTLIB''', '''HYDRA_CKPOINT_PREFIX''' and '''HYDRA_CKPOINT_INTERVAL'''.
  
== Persistent Proxies ==
+
Each checkpoint generates one file per node.  Note that checkpoints for all processes on a node will be stored in the same file.  Each time a new checkpoint is taken an additional set of files are created.  The files are numbered by the checkpoint number.  This allows the application to be restarted from checkpoints other than the most recent.  The checkpoint number can be specified with the -ckpoint-num parameter.  To restart a process:
  
Hydra also supports proxies to be launched in persistent mode on the system (e.g., by a system administrator). To launch in persistent mode, use:
+
<pre>
 +
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -ckpoint-num 5 -f hosts -n 4
 +
</pre>
  
<pre>
+
Note that by default, the process will be restarted from the first checkpoint, so in most cases, the checkpoint number should be specified.
shell$ mpiexec -boot-proxies -f hosts
 
  
shell$ mpiexec -use-persistent -f hosts -n 4 ./app1
 
  
shell$ mpiexec -use-persistent -f hosts -n 4 ./app2
+
== Demux Engines ==
  
shell$ mpiexec -use-persistent -f hosts -n 4 ./app3
+
Hydra supports multiple I/O demux engines including poll and select. The default is "poll". You can pick these through the mpiexec option -demux:
  
shell$ mpiexec -shutdown-proxies -f hosts
+
<pre>
 +
shell$ mpiexec -demux select -f hosts -n 4 ./app
 
</pre>
 
</pre>
  
Persistent mode can also be picked using the environment setting '''HYDRA_LAUNCH_MODE=persistent'''.
+
This can also be controlled by using the '''HYDRA_DEMUX''' environment variable.
  
The option "-boot-foreground-proxies" can be used to prevent persistent proxies from spawning a child process and exiting. This option is useful for debugging. This option can also be picked using the environment setting '''HYDRA_BOOT_FOREGROUND_PROXIES=1'''.
 
  
<pre>
+
== Debugging ==
shell$ mpiexec -boot-foreground-proxies -f hosts
+
 
 +
Hydra natively supports parallel debuggers such as totalview and DDT.
  
shell$ mpiexec -use-persistent -f hosts -n 4 ./app1
+
You can debug the MPI application with totalview by launching it as:
  
shell$ mpiexec -shutdown-proxies -f hosts
+
<pre>
 +
shell$ totalview mpiexec -a -f hosts -n 4 ./app
 
</pre>
 
</pre>
  
 +
The "-a" option is a totalview parameter which tells it that the arguments after that need to be passed to mpiexec.
  
== Communication sub-systems ==
 
  
Hydra supports different communication sub-systems to connect proxies in the persistent mode. The default is "none", which means that the proxies are not connected. You can pick these through the mpiexec
+
You can debug the MPI application with DDT by launching it as:
option -css:
 
  
 
<pre>
 
<pre>
shell$ mpiexec -css ib -f hosts -n 4 ./app
+
shell$ ddt mpiexec -a -f hosts -n 4 ./app
 
 
(or)
 
 
 
shell$ mpiexec -css mx -f hosts -n 4 ./app
 
 
</pre>
 
</pre>
  
This can also be controlled by using the '''HYDRA_CSS''' environment variable.
+
The "-a" option is a DDT parameter which tells it that the arguments after that need to be passed to mpiexec.
  
  
== Resource Manager Integration ==
+
For serial debuggers such as 'ddd', you can launch each process with a separate debugger window as:
 
 
Hydra provides capability to integrate with different resource managers. The default is "dummy", which means no resource manager. You can pick these through the mpiexec option -rmk:
 
  
 
<pre>
 
<pre>
shell$ mpiexec -rmk pbs -f hosts -n 4 ./app
+
shell$ mpiexec -f hosts -n 4 ddd ./app
 
</pre>
 
</pre>
  
This can also be controlled by using the '''HYDRA_RMK''' environment variable.
+
This will spawn four copies of "./app" each one launched under a separate 'ddd' instance.
  
== Checkpoint/Restart Support ==
 
  
Hydra (experimentally) provides checkpoint/restart capability. Currently, only BLCR is being experimented with. You can pick these through the mpiexec option -ckpointlib to specify the checkpointing library to use and -ckpoint-prefix to specify the prefix of the file to write the checkpoint image to:
+
You can do something similar to 'ddd' for serial debuggers that do not have a graphical interface, such as 'gdb', by launching them over 'xterm':
  
 
<pre>
 
<pre>
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4 ./app
+
shell$ mpiexec -f hosts -n 4 xterm -e gdb ./app
 
</pre>
 
</pre>
  
While the application is running, the user can request for a checkpoint at any time by sending a SIGTSTOP signal (Ctrl+Z) to mpiexec.
 
  
You can also automatically checkpoint the application at regular intervals using the mpiexec option -ckpoint-interval (seconds):
+
If you want only the third process to run in a debugger, and the remaining processes to run normally, you can do:
  
 
<pre>
 
<pre>
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -ckpoint-interval 3600 -f hosts -n 4 ./app
+
shell$ mpiexec -f hosts -n 2 ./app : -n 1 xterm -e gdb ./app : -n 1 ./app
 
</pre>
 
</pre>
  
The checkpoint/restart parameters can be controlled with the environment variables '''HYDRA_CKPOINTLIB''', '''HYDRA_CKPOINT_PREFIX''' and '''HYDRA_CKPOINT_INTERVAL'''.
 
  
To restart a process:
+
== Using Hydra on Machines with Different User Names ==
 +
 
 +
Hydra only supports using different usernames on some launchers (such as ssh and rsh). For this, the host file should contain a "user=<foo>" entry.
 +
 
  
 
<pre>
 
<pre>
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4
+
shell$ cat hosts
 +
  donner  user=foo
 +
  foo    user=bar
 +
  shakey  user=bar
 +
  terra  user=foo
 
</pre>
 
</pre>
  
 +
== Managing Output ==
  
== Hydra in hybrid environments ==
+
For a moderate amount of output, <tt>-prepend-rank</tt> will add a <tt>[$N]</tt> (where $N is a process's global rank) to each line of output.
  
Hydra can be used to launch other process managers as well, such as a UPC launcher, for example:
+
For larger amounts of output  <tt>-outfile-pattern</tt> (for stdout) and <tt>-errfile-pattern</tt> (for stderr) can direct output to specific files
  
 +
For example:
 
<pre>
 
<pre>
shell$ mpiexec -n 2 -ranks-per-proc=4 upcrun -n 4 ./app
+
$ mpiexec -outfile-pattern=output.log-%r-%h -errfile-pattern=output.err-%r-%h
 
</pre>
 
</pre>
 +
will create files with the global MPI rank (%r) and hostname (%h) embedded in the name
  
This launches two instances of upcrun, each of which is expected to launch 4 application processes (two subgroups of processes). Hydra needs the -ranks-per-proc argument to tell it how many MPI ranks it needs to allocate to each group of processes.
+
== Name Publishing ==
  
If the internal nested environment also needs to use Hydra as a launcher, but not as a process manager, this can be set using:
+
In MPICH 3.1, Hydra is the default name service for MPI's name publishing features (MPI_PUBLISH_NAME/MPI_LOOKUP_NAME). For multihost programs to use this feature, you must first start a hydra_nameserver process and tell mpiexec which host it is running on.
  
<pre>
+
For example:
shell$ mpiexec -n 2 -ranks-per-proc=4 mpiexec -n 4 -disable-pm-env ./app
 
  
(or)
+
  [email protected]$ hydra_nameserver &
 
+
  shell@myhost1$ mpiexec -hosts myhost1,myhost2 -n 4 -nameserver myhost1 ./a.out
shell$ mpiexec -n 2 -ranks-per-proc=4 HYDRA_PM_ENV=0 mpiexec -n 4 ./app
 
</pre>
 

Latest revision as of 14:41, 4 October 2021

This wiki page only provides information on the external usage of Hydra. If you are looking for the internal workings of Hydra, you can find it here.


General

Hydra is a process management system for starting parallel jobs. Hydra is designed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and sge.


Getting Help

All of the runtime options can be listed on the command line using the flag:

mpiexec -help

For further information about any specific flag, the information can be obtained by adding the option before the help flag:

mpiexec -<option> -help

Quick Start

Once built, the Hydra executables are in the bin subdirectory of the install directory if you have done an install. You should put this (bin) directory in your PATH in your .cshrc or .bashrc for usage convenience:

Put in .cshrc:  setenv PATH /home/you/mpich/bin:$PATH

Put in .bashrc: export PATH=/home/you/mpich/bin:$PATH

To compile your application use mpicc:

shell$ mpicc app.c -o app

Create a file with the names of the machines that you want to run your job on. This file may or may not include the local machine.

shell$ cat hosts
   donner
   foo
   shakey
   terra

To run your application on these nodes, use mpiexec:

shell$ mpiexec -f hosts -n 4 ./app

If the same application is run with 8 processes, it will use the hosts in a round robin manner. For instance, with the above host file, ranks 0 and 4 will execute on donner, ranks 1 and 5 on foo, ranks 2 and 6 on shakey and ranks 3 and 7 on terra.

The host file can also be specified as follows:

shell$ cat hosts
   donner:2
   foo:2
   shakey:2
   terra:2

In this case, ranks 0-1 will execute on donner, ranks 2-3 on foo, ranks 4-5 on shakey and ranks 6-7 on terra.

A more complex host file layout can be:

shell$ cat hosts
   donner:2
   foo:3
   shakey:2

In this case, the first 2 processes are scheduled on "donner", the next 3 on "foo" and the last 2 on "shakey". Comments in the host file start with a "#" character.

shell$ cat hosts
   # This is a sample host file
   donner:2     # The first 2 procs are scheduled to run here
   foo:3        # The next 3 procs run on this host
   shakey:2     # The last 2 procs run on this host

You can spawn multiple different executables using:

shell$ mpiexec -f hosts -n 4 ./app1 : -n 4 ./app2

The first four ranks will be of "app1" and the next 4 ranks will be of "app2".


Environment Settings

By default, hydra passes inherits environment variables from the shell on which mpiexec is launched, and passes it to the launched processes. However, each launcher disables the automatic propagation of some environment variables. For example, the SSH launcher disables the propagation of "DISPLAY", the SLURM launcher disables the propagation of "SLURM_*" variables, etc.

Users can force an environment variable to be propagated using the -genv option.


Environment variables can also be used to control several settings in Hydra:

HYDRA_HOST_FILE: This variable points to the default host file to use, when the "-f" option is not provided to mpiexec.

  For bash:
    export HYDRA_HOST_FILE=<path_to_host_file>/hosts

  For csh/tcsh:
    setenv HYDRA_HOST_FILE <path_to_host_file>/hosts

HYDRA_DEBUG: Setting this to "1" enables debug mode; set it to "0" to disable.

HYDRA_ENV: Setting this to "all" will pass all of the launching node's environment to the application processes. By default, if nothing is set, the launching node's environment is passed to the executables, as long as it does not overwrite any of the environment variables that have been preset by the remote shell.

HYDRA_LAUNCHER_EXTRA_ARGS: The value of this environment variable is appended to the launcher. This can be used, for example, to pass the "-q" argument to the launcher by setting HYDRA_LAUNCH_EXTRA_ARGS=-q.

MPIEXEC_TIMEOUT: The value of this environment variable is the maximum number of seconds this job will be permitted to run. When time is up, the job is aborted.

MPIEXEC_PORT_RANGE: If this environment variable is defined then Hydra will restrict its usage of ports for connecting its various processes to ports in this range. If this variable is not assigned, but MPICH_PORT_RANGE is assigned, then it will use the range specified by MPICH_PORT_RANGE for its ports. Otherwise, it will use whatever ports are assigned to it by the system. Port ranges are given as a pair of integers separated by a colon.

HYDRA_PROXY_RETRY_COUNT: The value of this environment variable determines the number of retries a proxy does to establish a connection to the main server.


Hydra with Non-Ethernet Networks

There are two ways to use Hydra with TCP/IP over the non-default network.

The first way is using the -iface option to mpiexec to specify which network interface to use. For example, if your Myrinet network's IP emulation is configured on myri0, you can use:

shell$ mpiexec -f hostfile -iface myri0 -n 4 ./app1

Similarly, if your InfiniBand network's IP emulation is configured on ib0, you can use:

shell$ mpiexec -f hostfile -iface ib0 -n 4 ./app1

You can also control this using the HYDRA_IFACE environment variable.


The second way is to specify the appropriate IP addresses in your hostfile.

shell$ /sbin/ifconfig

eth0      Link encap:Ethernet  HWaddr 00:14:5E:57:C4:FA  
          inet addr:192.148.3.182  Bcast:192.148.248.255  Mask:255.255.255.0
          inet6 addr: fe80::214:5eff:fe57:c4fa/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:989925894 errors:0 dropped:7186 overruns:0 frame:0
          TX packets:1480277023 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:441568994866 (411.2 GiB)  TX bytes:1864173370054 (1.6 TiB)
          Interrupt:185 Memory:e2000000-e2012100 

myri0     Link encap:Ethernet  HWaddr 00:14:5E:57:C4:F8  
          inet addr:10.21.3.182  Bcast:10.21.255.255  Mask:255.255.0.0
          inet6 addr: fe80::214:5eff:fe57:c4f8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3068986439 errors:0 dropped:7841 overruns:0 frame:0
          TX packets:2288060450 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3598751494497 (3.2 TiB)  TX bytes:1744058613150 (1.5 TiB)
          Interrupt:185 Memory:e4000000-e4012100 

ib0       Link encap:Ethernet  HWaddr 00:14:5E:57:C4:F8  
          inet addr:20.31.3.182  Bcast:10.21.255.255  Mask:255.255.0.0
          inet6 addr: fe80::214:5eff:fe57:c4f8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3068986439 errors:0 dropped:7841 overruns:0 frame:0
          TX packets:2288060450 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3598751494497 (3.2 TiB)  TX bytes:1744058613150 (1.5 TiB)
          Interrupt:185 Memory:e4000000-e4012100 

In the above case the 192.148.x.x IP series refers to the standard Ethernet (or Gigabit Ethernet) network, the 10.21.x.x series refers to Myrinet and the 10.31.x.x refers to InfiniBand.

shell$ cat hostfile-eth
   192.148.3.180
   192.148.3.181
   192.148.3.182
   192.148.3.183

shell$ cat hostfile-myri
   10.21.3.180
   10.21.3.181
   10.21.3.182
   10.21.3.183

shell$ cat hostfile-ib
   10.31.3.180
   10.31.3.181
   10.31.3.182
   10.31.3.183

To run over the Ethernet interface use:

shell$ mpiexec -f hostfile-eth -n 4 ./app1

To run over the Myrinet interface use:

shell$ mpiexec -f hostfile-myri -n 4 ./app1


Resource Managers and Launchers

Hydra natively interacts with a number of resource managers and launchers.

Resource managers provide information about the resources allocated by the user (e.g., slurm, pbs, loadleveler, lsf, sge, cobalt). Launchers allow mpiexec to launch processes on the system (e.g., ssh, rsh, fork, slurm, pbs, loadleveler, lsf, sge). Some tools act as both resource managers and launchers, while others play just one role.

In most cases, you don't need to do anything special for Hydra to work with your slurm, pbs, loadleveler, lsf, sge or cobalt environment. It'll automatically detect these resource managers and interact with them. The following command will be sufficient to launch processes on all the allocated cores:

shell$ mpiexec ./app

The above command will allow Hydra to inherit information about the resource manager and launcher available, resources allocated from the resource manager, and finally launch the appropriate number of processes.

However, if you do not want Hydra to auto-detect the launcher, you can customize it using the -launcher flag to mpiexec or the HYDRA_LAUNCHER environment variable.

shell$ mpiexec -launcher ssh -f hosts -n 4 ./app

 (or)

shell$ mpiexec -launcher fork -f hosts -n 4 ./app

The executable to use as the launcher can also be customized using the option -launcher-exec or the HYDRA_LAUNCHER_EXEC environment variable:

 $ mpiexec -launcher ssh -launcher-exec /usr/bin/ssh -f hosts -n 4 ./app


Similarly, if you do not want Hydra to auto-detect the resource manager, you can customize using the -rmk flag to mpiexec or the HYDRA_RMK environment variable.

shell$ mpiexec -rmk pbs ./app

Finally, if you do not want Hydra to auto-detect the number of processes allocated or the machines allocated, you can customize that using the -n and -f flags.

shell$ mpiexec -rmk pbs -n 4 -f ~/hosts ./app

Process-core Binding

On supported platforms, Hydra automatically configures available process-core binding capability using hwloc. More info on how to use these options can be found in the help text:

 shell$ mpiexec -bind-to -help
  • -bind-to <object[:num]> - Specify the hardware element to bind each process to
  • -map-by <object[:num]> - Skip over this many elements between bindings. If not specified, defaults to the same value as -bind-to.

Binding options can also be controlled with the environment variables HYDRA_TOPOLIB, HYDRA_BINDING, HYDRA_MAPPING, and HYDRA_MEMBIND.

Examples

For the purposes of process binding/mapping examples, we will assume the following machine topology.

Topo4.png

Setting the env variable HYDRA_TOPO_DEBUG=1 will force hydra to print out bindings, but not call the binding API. This is useful for testing binding options using hwloc's arbitrary topology loading feature, as is demonstrated below.

To bind processes to single hardware threads:

 shell$ HYDRA_TOPO_DEBUG=1 mpiexec -n 8 -bind-to hwthread /bin/true | sort -k 2 -n
 process 0 binding: 10000000 
 process 1 binding: 00100000 
 process 2 binding: 00001000 
 process 3 binding: 00000010 
 process 4 binding: 01000000 
 process 5 binding: 00010000 
 process 6 binding: 00000100 
 process 7 binding: 00000001 

To bind processes to 2 cores at a time:

 shell$ HYDRA_TOPO_DEBUG=1 mpiexec -n 8 -bind-to core:2 /bin/true | sort -k 2 -n
 process 0 binding: 10100000 
 process 1 binding: 00001010 
 process 2 binding: 01010000 
 process 3 binding: 00000101 
 process 4 binding: 10100000 
 process 5 binding: 00001010 
 process 6 binding: 01010000 
 process 7 binding: 00000101

A common use-case is to bind processes to a socket, and map by hwthread to group sequential ranks together.

 shell$ HYDRA_TOPO_DEBUG=1 mpiexec -n 8 -bind-to socket -map-by hwthread /bin/true | sort -k 2 -n
 process 0 binding: 10101010 
 process 1 binding: 10101010 
 process 2 binding: 10101010 
 process 3 binding: 10101010 
 process 4 binding: 01010101 
 process 5 binding: 01010101 
 process 6 binding: 01010101 
 process 7 binding: 01010101

Also included in the MPICH source is a program for printing out the affinity of a process (src/pm/hydra/examples/print_cpus_allowed.c) according to the OS. This can be used on Linux systems to test that bindings are working correctly.

 shell$ mpiexec -n 8 -bind-to socket ./print_cpus_allowed | sort
 crush[0]: Cpus_allowed_list:    0,2,4,6
 crush[1]: Cpus_allowed_list:    1,3,5,7
 crush[2]: Cpus_allowed_list:    0,2,4,6
 crush[3]: Cpus_allowed_list:    1,3,5,7
 crush[4]: Cpus_allowed_list:    0,2,4,6
 crush[5]: Cpus_allowed_list:    1,3,5,7
 crush[6]: Cpus_allowed_list:    0,2,4,6
 crush[7]: Cpus_allowed_list:    1,3,5,7

X Forwarding

X-forwarding is specific to each bootstrap server. Some servers do it by default, while some don't. For ssh, this is disabled by default. To force-enable it, you should use the option -enable-x to mpiexec.

shell$ mpiexec -enable-x -f hosts -n 4 ./app


Checkpoint/Restart Support

Hydra provides checkpoint/restart capability. Currently, only BLCR is supported. To use checkpointing include the -ckpointlib option for mpiexec to specify the checkpointing library to use and -ckpoint-prefix to specify the directory where the checkpoint images should be written:

shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -f hosts -n 4 ./app

While the application is running, the user can request for a checkpoint at any time by sending a SIGUSR1 signal to mpiexec.

You can also automatically checkpoint the application at regular intervals using the mpiexec option -ckpoint-interval to specify the number of seconds between checkpoints:

shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -ckpoint-interval 3600 -f hosts -n 4 ./app

The checkpoint/restart parameters can be controlled with the environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and HYDRA_CKPOINT_INTERVAL.

Each checkpoint generates one file per node. Note that checkpoints for all processes on a node will be stored in the same file. Each time a new checkpoint is taken an additional set of files are created. The files are numbered by the checkpoint number. This allows the application to be restarted from checkpoints other than the most recent. The checkpoint number can be specified with the -ckpoint-num parameter. To restart a process:

shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -ckpoint-num 5 -f hosts -n 4

Note that by default, the process will be restarted from the first checkpoint, so in most cases, the checkpoint number should be specified.


Demux Engines

Hydra supports multiple I/O demux engines including poll and select. The default is "poll". You can pick these through the mpiexec option -demux:

shell$ mpiexec -demux select -f hosts -n 4 ./app

This can also be controlled by using the HYDRA_DEMUX environment variable.


Debugging

Hydra natively supports parallel debuggers such as totalview and DDT.

You can debug the MPI application with totalview by launching it as:

shell$ totalview mpiexec -a -f hosts -n 4 ./app

The "-a" option is a totalview parameter which tells it that the arguments after that need to be passed to mpiexec.


You can debug the MPI application with DDT by launching it as:

shell$ ddt mpiexec -a -f hosts -n 4 ./app

The "-a" option is a DDT parameter which tells it that the arguments after that need to be passed to mpiexec.


For serial debuggers such as 'ddd', you can launch each process with a separate debugger window as:

shell$ mpiexec -f hosts -n 4 ddd ./app

This will spawn four copies of "./app" each one launched under a separate 'ddd' instance.


You can do something similar to 'ddd' for serial debuggers that do not have a graphical interface, such as 'gdb', by launching them over 'xterm':

shell$ mpiexec -f hosts -n 4 xterm -e gdb ./app


If you want only the third process to run in a debugger, and the remaining processes to run normally, you can do:

shell$ mpiexec -f hosts -n 2 ./app : -n 1 xterm -e gdb ./app : -n 1 ./app


Using Hydra on Machines with Different User Names

Hydra only supports using different usernames on some launchers (such as ssh and rsh). For this, the host file should contain a "user=<foo>" entry.


shell$ cat hosts
   donner  user=foo
   foo     user=bar
   shakey  user=bar
   terra   user=foo

Managing Output

For a moderate amount of output, -prepend-rank will add a [$N] (where $N is a process's global rank) to each line of output.

For larger amounts of output -outfile-pattern (for stdout) and -errfile-pattern (for stderr) can direct output to specific files

For example:

$ mpiexec -outfile-pattern=output.log-%r-%h -errfile-pattern=output.err-%r-%h

will create files with the global MPI rank (%r) and hostname (%h) embedded in the name

Name Publishing

In MPICH 3.1, Hydra is the default name service for MPI's name publishing features (MPI_PUBLISH_NAME/MPI_LOOKUP_NAME). For multihost programs to use this feature, you must first start a hydra_nameserver process and tell mpiexec which host it is running on.

For example:

 [email protected]$ hydra_nameserver &
 [email protected]$ mpiexec -hosts myhost1,myhost2 -n 4 -nameserver myhost1 ./a.out