Using the Hydra Process Manager
This wiki page only provides information on the external usage of Hydra. If you are looking for the internal workings of Hydra, you can find it here.
Hydra is a process management system for starting parallel jobs. Hydra is designed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and sge. However, in the current release, only ssh, rsh and fork are supported, with a preliminary version of slurm available.
Starting MPICH2-1.1, hydra is compiled into MPICH2 releases by default as a alternate process manager. You can use it as mpiexec.hydra.
Once built, the Hydra executables are in mpich2/bin, or the bin subdirectory of the install directory if you have done an install. You should put this (bin) directory in your PATH in your .cshrc or .bashrc for usage convenience:
Put in .cshrc: setenv PATH /home/you/mpich2/bin:$PATH Put in .bashrc: export PATH=/home/you/mpich2/bin:$PATH
To compile your application use mpicc:
shell$ mpicc app.c -o app
Create a file with the names of the machines that you want to run your job on. This file may or may not include the local machine.
shell$ cat hosts donner foo shakey terra
To run your application on these nodes, use mpiexec:
shell$ mpiexec -f hosts -n 4 ./app
The host file can also be specified as follows:
shell$ cat hosts donner:2 foo:3 shakey:2
In this case, the first 2 processes are scheduled on "donner", the next 3 on "foo" and the last 2 on "shakey". Comments in the host file start with a "#" character.
shell$ cat hosts # This is a sample host file donner:2 # The first 2 procs are scheduled to run here foo:3 # The next 3 procs run on this host shakey:2 # The last 2 procs run on this host
Hydra with Non-Ethernet Networks
If you want to use Hydra with TCP/IP on the non-default network, you just need to specify those IP addresses in your hostfile.
shell$ /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:14:5E:57:C4:FA inet addr:188.8.131.52 Bcast:184.108.40.206 Mask:255.255.255.0 inet6 addr: fe80::214:5eff:fe57:c4fa/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:989925894 errors:0 dropped:7186 overruns:0 frame:0 TX packets:1480277023 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:441568994866 (411.2 GiB) TX bytes:1864173370054 (1.6 TiB) Interrupt:185 Memory:e2000000-e2012100 myri0 Link encap:Ethernet HWaddr 00:14:5E:57:C4:F8 inet addr:10.21.3.182 Bcast:10.21.255.255 Mask:255.255.0.0 inet6 addr: fe80::214:5eff:fe57:c4f8/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3068986439 errors:0 dropped:7841 overruns:0 frame:0 TX packets:2288060450 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3598751494497 (3.2 TiB) TX bytes:1744058613150 (1.5 TiB) Interrupt:185 Memory:e4000000-e4012100 ib0 Link encap:Ethernet HWaddr 00:14:5E:57:C4:F8 inet addr:220.127.116.11 Bcast:10.21.255.255 Mask:255.255.0.0 inet6 addr: fe80::214:5eff:fe57:c4f8/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3068986439 errors:0 dropped:7841 overruns:0 frame:0 TX packets:2288060450 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3598751494497 (3.2 TiB) TX bytes:1744058613150 (1.5 TiB) Interrupt:185 Memory:e4000000-e4012100
In the above case the 192.148.x.x IP series refers to the standard Ethernet (or Gigabit Ethernet) network, the 10.21.x.x series refers to Myrinet and the 10.31.x.x refers to InfiniBand.
shell$ cat hostfile-eth 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 shell$ cat hostfile-myri 10.21.3.180 10.21.3.181 10.21.3.182 10.21.3.183 shell$ cat hostfile-ib 10.31.3.180 10.31.3.181 10.31.3.182 10.31.3.183
To run over the Ethernet interface use:
shell$ mpiexec -f hostfile-eth -n 4 ./app1
To run over the Myrinet interface use:
shell$ mpiexec -f hostfile-myri -n 4 ./app1
HYDRA_HOST_FILE: This variable points to the default host file to use, when the "-f" option is not provided to mpiexec.
For bash: export HYDRA_HOST_FILE=<path_to_host_file>/hosts For csh/tcsh: setenv HYDRA_HOST_FILE <path_to_host_file>/hosts
HYDRA_DEBUG: Setting this to "1" enables debug mode; set it to "0" to disable.
HYDRA_ENV: Setting this to "all" will pass all the environment to the application processes.
HYDRA_PROXY_PORT: The port to use for the proxies.
A bootstrap server is the basic remote node access mechanism that is provided on any system. Hydra supports multiple bootstrap servers including ssh, rsh, fork, and slurm to launch processes. All of these are compiled in by default, so you can pick any one of them at runtime using the mpiexec option -bootstrap:
shell$ mpiexec -bootstrap ssh -f hosts -n 4 ./app (or) shell$ mpiexec -bootstrap fork -f hosts -n 4 ./app
This can also be controlled by using the HYDRA_BOOTSTRAP environment variable.
The default bootstrap server is ssh.
The executable to use as the bootstrap server can be specified using the option -bootstrap-exec:
$ mpiexec -bootstrap ssh -bootstrap-exec /usr/bin/ssh -f hosts -n 4 ./app
This can also be specified using the HYDRA_BOOTSTRAP_EXEC environment variable. If the bootstrap executable is not specified, Hydra will automatically look for it in your path and other known locations.
On supported platforms, Hydra automatically configures available process-core binding capability (currently using PLPA). We support multiple levels of allocation strategies:
- Basic allocation strategies: There are two forms of basic allocation: (i) based on a round-robin mechanism using the OS specified processor IDs, and (ii) based on a user-defined mapping. Further, for the user-defined mapping, two schemes are provided---command-line and host-file based. The command-line scheme lets the user specify a common-mapping for all physical nodes on the command line. The host-file scheme is the most general and lets the user specify the mapping for each node separately.
The modes of process binding in the basic allocation are: round-robin ("rr") and user-defined ("user").
shell$ mpiexec -binding rr -f hosts -n 8 ./app
Within the user-defined binding, two modes are supported: command-line and host-file based. The command-line based mode can be used as follows:
shell$ mpiexec -binding user:0,3 -f hosts -n 4 ./app
If a machine has 4 processing elements, and only two bindings are provided (as in the above example), the rest are padded with (-1), which refers to no binding. Also, the mapping is the same for all machines; so if the application is run with 8 processes, the first 2 processes on "each machine" are bound to processing elements as specified.
The host-file based mode for user-defined binding can be used by the "map=" argument on each host line. E.g.:
shell$ cat hosts donner:4 map=0,-1,-1,3 foo:4 map=3,2 shakey:2
Using this method, each host can be given a different mapping. Any unspecified mappings are treated as (-1), referring to no binding.
Command-line based mappings are given a higher priority than the host-file based mappings. So, if a mapping is given at both places, the host-file mappings are ignored.
- Topology-aware allocation strategies: These are a bit more intelligent in that they try to understand the system processing unit topology and assign processes in that order. Currently, "dense" and "sparse" schemes are provided. The "dense" scheme packs everything as closely as it can; this tries to maximize resource sharing hoping that the communication library can take advantage of this packing for better performance. The "sparse" scheme loops between all the available processing units to minimize the resource sharing (assuming the closer the processes are the more resources that they share).
Different modes of process binding in the topology-aware allocation are:
CPU based options:
- sparse:sockets,cores,threads -- use all CPU resources
- sparse:sockets,cores -- avoid using multiple threads on a core
- sparse:sockets -- avoid using multiple cores on a socket
- dense:sockets,cores,threads -- use all CPU resources
- dense:sockets,cores -- avoid using multiple threads on a core
- dense:sockets -- avoid using multiple cores on a socket
Memory based options:
- sparse:l1,l2,l3,mem -- use all memory resources
- sparse:l2,l3,mem -- avoid sharing l1 cache
- sparse:l3,mem -- avoid using l2 cache
- sparse:mem -- avoid using l3 cache
- dense:l1,l2,l3,mem -- use all memory resources
- dense:l2,l3,mem -- avoid sharing l1 cache
- dense:l3,mem -- avoid using l2 cache
- dense:mem -- avoid using l3 cache
shell$ mpiexec -binding dense -f hosts -n 8 ./app (or) shell$ mpiexec -binding dense:sockets,cores -f hosts -n 8 ./app (or) shell$ mpiexec -binding sparse:l1,l2,l3,mem -f hosts -n 6 ./app
Consider the following layout of processing elements in the system (e.g., two nodes, each with two processors, and each processor with two cores). Suppose the Operating System assigned processor IDs for each of these processing elements are as shown below:
__________________________________________ __________________________________________ | _________________ _________________ | | _________________ _________________ | | | _____ _____ | | _____ _____ | | | | _____ _____ | | _____ _____ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 0 | | 2 | | | | 1 | | 3 | | | | | | 0 | | 2 | | | | 1 | | 3 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |_____| |_____| | | |_____| |_____| | | | | |_____| |_____| | | |_____| |_____| | | | |_________________| |_________________| | | |_________________| |_________________| | |__________________________________________| |__________________________________________|
In this case, the binding options are as follows:
- RR: 0, 1, 2, 3 (use the order provided by the OS)
- Sparse: 0, 1, 2, 3 (increasing sharing of resources)
- Dense: 0, 2, 1, 3 (closest packing)
- User: as defined by the user
Binding options can also be controlled with the environment variable HYDRA_BINDING.
X-forwarding is specific to each bootstrap server. Some servers do it by default, while some don't. For ssh, this is disabled by default. To force-enable it, you should use the option -enable-x to mpiexec.
shell$ mpiexec -enable-x -f hosts -n 4 ./app
Hydra also supports proxies to be launched in persistent mode on the system (e.g., by a system administrator). To launch in persistent mode, use:
shell$ mpiexec -boot-proxies -f hosts shell$ mpiexec -use-persistent -f hosts -n 4 ./app1 shell$ mpiexec -use-persistent -f hosts -n 4 ./app2 shell$ mpiexec -use-persistent -f hosts -n 4 ./app3 shell$ mpiexec -shutdown-proxies -f hosts
Persistent mode can also be picked using the environment setting HYDRA_LAUNCH_MODE=persistent.
The option "-boot-foreground-proxies" can be used to prevent persistent proxies from spawning a child process and exiting. This option is useful for debugging. This option can also be picked using the environment setting HYDRA_BOOT_FOREGROUND_PROXIES=1.
shell$ mpiexec -boot-foreground-proxies -f hosts shell$ mpiexec -use-persistent -f hosts -n 4 ./app1 shell$ mpiexec -shutdown-proxies -f hosts
Hydra supports different communication sub-systems to connect proxies in the persistent mode. The default is "none", which means that the proxies are not connected. You can pick these through the mpiexec option -css:
shell$ mpiexec -css ib -f hosts -n 4 ./app (or) shell$ mpiexec -css mx -f hosts -n 4 ./app
This can also be controlled by using the HYDRA_CSS environment variable.
Resource Manager Integration
Hydra provides capability to integrate with different resource managers. The default is "dummy", which means no resource manager. You can pick these through the mpiexec option -rmk:
shell$ mpiexec -rmk pbs -f hosts -n 4 ./app
This can also be controlled by using the HYDRA_RMK environment variable.
Hydra (experimentally) provides checkpoint/restart capability. Currently, only BLCR is being experimented with. You can pick these through the mpiexec option -ckpointlib to specify the checkpointing library to use and -ckpoint-prefix to specify the prefix of the file to write the checkpoint image to:
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4 ./app
While the application is running, the user can request for a checkpoint at any time by sending a SIGTSTOP signal (Ctrl+Z) to mpiexec.
You can also automatically checkpoint the application at regular intervals using the mpiexec option -ckpoint-interval (seconds):
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -ckpoint-interval 3600 -f hosts -n 4 ./app
The checkpoint/restart parameters can be controlled with the environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and HYDRA_CKPOINT_INTERVAL.
To restart a process:
shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4
Hydra in hybrid environments
Hydra can be used to launch other process managers as well, such as a UPC launcher, for example:
shell$ mpiexec -n 2 -ranks-per-proc=4 upcrun -n 4 ./app
This launches two instances of upcrun, each of which is expected to launch 4 application processes (two subgroups of processes). Hydra needs the -ranks-per-proc argument to tell it how many MPI ranks it needs to allocate to each group of processes.
If the internal nested environment also needs to use Hydra as a launcher, but not as a process manager, this can be set using:
shell$ mpiexec -n 2 -ranks-per-proc=4 mpiexec -n 4 -disable-pm-env ./app (or) shell$ mpiexec -n 2 -ranks-per-proc=4 HYDRA_PM_ENV=0 mpiexec -n 4 ./app