Using the Hydra Process Manager

From Mpich
Revision as of 18:04, 29 September 2010 by Balaji (talk | contribs)

Jump to: navigation, search

This wiki page only provides information on the external usage of Hydra. If you are looking for the internal workings of Hydra, you can find it here.


Hydra is a process management system for starting parallel jobs. Hydra is designed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and sge.

Quick Start

Starting MPICH2-1.1, hydra is compiled into MPICH2 releases by default as a alternate process manager. You can use it as mpiexec.hydra.

Starting MPICH2-1.3, hydra is the default process manager, which is automatically used when you use mpiexec.

Once built, the Hydra executables are in mpich2/bin, or the bin subdirectory of the install directory if you have done an install. You should put this (bin) directory in your PATH in your .cshrc or .bashrc for usage convenience:

Put in .cshrc:  setenv PATH /home/you/mpich2/bin:$PATH

Put in .bashrc: export PATH=/home/you/mpich2/bin:$PATH

To compile your application use mpicc:

shell$ mpicc app.c -o app

Create a file with the names of the machines that you want to run your job on. This file may or may not include the local machine.

shell$ cat hosts

To run your application on these nodes, use mpiexec:

shell$ mpiexec -f hosts -n 4 ./app

The host file can also be specified as follows:

shell$ cat hosts

In this case, the first 2 processes are scheduled on "donner", the next 3 on "foo" and the last 2 on "shakey". Comments in the host file start with a "#" character.

shell$ cat hosts
   # This is a sample host file
   donner:2     # The first 2 procs are scheduled to run here
   foo:3        # The next 3 procs run on this host
   shakey:2     # The last 2 procs run on this host

Environment Settings

HYDRA_HOST_FILE: This variable points to the default host file to use, when the "-f" option is not provided to mpiexec.

  For bash:
    export HYDRA_HOST_FILE=<path_to_host_file>/hosts

  For csh/tcsh:
    setenv HYDRA_HOST_FILE <path_to_host_file>/hosts

HYDRA_DEBUG: Setting this to "1" enables debug mode; set it to "0" to disable.

HYDRA_ENV: Setting this to "all" will pass all of the launching node's environment to the application processes. By default, if nothing is set, the launching node's environment is passed to the executables, as long as it does not overwrite any of the environment variables that have been preset by the remote shell.

MPIEXEC_TIMEOUT: The value of this environment variable is the maximum number of seconds this job will be permitted to run. When time is up, the job is aborted.

MPIEXEC_PORT_RANGE: If this environment variable is defined then Hydra will restrict its usage of ports for connecting its various processes to ports in this range. If this variable is not assigned, but MPICH_PORT_RANGE is assigned, then it will use the range specified by MPICH_PORT_RANGE for its ports. Otherwise, it will use whatever ports are assigned to it by the system. Port ranges are given as a pair of integers separated by a colon.

Hydra with Non-Ethernet Networks

There are two ways to use Hydra with TCP/IP over the non-default network.

The first way is using the -iface option to mpiexec to specify which network interface to use. For example, if your Myrinet network's IP emulation is configured on myri0, you can use:

shell$ mpiexec -f hostfile -iface myri0 -n 4 ./app1

Similarly, if your InfiniBand network's IP emulation is configured on ib0, you can use:

shell$ mpiexec -f hostfile -iface ib0 -n 4 ./app1

You can also control this using the HYDRA_IFACE environment variable.

The second way is to specify the appropriate IP addresses in your hostfile.

shell$ /sbin/ifconfig

eth0      Link encap:Ethernet  HWaddr 00:14:5E:57:C4:FA  
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::214:5eff:fe57:c4fa/64 Scope:Link
          RX packets:989925894 errors:0 dropped:7186 overruns:0 frame:0
          TX packets:1480277023 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:441568994866 (411.2 GiB)  TX bytes:1864173370054 (1.6 TiB)
          Interrupt:185 Memory:e2000000-e2012100 

myri0     Link encap:Ethernet  HWaddr 00:14:5E:57:C4:F8  
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::214:5eff:fe57:c4f8/64 Scope:Link
          RX packets:3068986439 errors:0 dropped:7841 overruns:0 frame:0
          TX packets:2288060450 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3598751494497 (3.2 TiB)  TX bytes:1744058613150 (1.5 TiB)
          Interrupt:185 Memory:e4000000-e4012100 

ib0       Link encap:Ethernet  HWaddr 00:14:5E:57:C4:F8  
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::214:5eff:fe57:c4f8/64 Scope:Link
          RX packets:3068986439 errors:0 dropped:7841 overruns:0 frame:0
          TX packets:2288060450 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3598751494497 (3.2 TiB)  TX bytes:1744058613150 (1.5 TiB)
          Interrupt:185 Memory:e4000000-e4012100 

In the above case the 192.148.x.x IP series refers to the standard Ethernet (or Gigabit Ethernet) network, the 10.21.x.x series refers to Myrinet and the 10.31.x.x refers to InfiniBand.

shell$ cat hostfile-eth

shell$ cat hostfile-myri

shell$ cat hostfile-ib

To run over the Ethernet interface use:

shell$ mpiexec -f hostfile-eth -n 4 ./app1

To run over the Myrinet interface use:

shell$ mpiexec -f hostfile-myri -n 4 ./app1

Bootstrap Servers

A bootstrap server is the basic remote node access mechanism that is provided on any system. Hydra supports multiple bootstrap servers including ssh, rsh, fork, and slurm to launch processes. All of these are compiled in by default, so you can pick any one of them at runtime using the mpiexec option -bootstrap:

shell$ mpiexec -bootstrap ssh -f hosts -n 4 ./app


shell$ mpiexec -bootstrap fork -f hosts -n 4 ./app

This can also be controlled by using the HYDRA_BOOTSTRAP environment variable.

The default bootstrap server is ssh.

The executable to use as the bootstrap server can be specified using the option -bootstrap-exec:

 $ mpiexec -bootstrap ssh -bootstrap-exec /usr/bin/ssh -f hosts -n 4 ./app

This can also be specified using the HYDRA_BOOTSTRAP_EXEC environment variable. If the bootstrap executable is not specified, Hydra will automatically look for it in your path and other known locations.

Hydra must have been configured with the bootstrap server you specify. By default, hydra is configured with a set of standard and popular bootstrap servers. The configure option --with-hydra-bss may be used to specify a set of bootstrap servers; for example

 $ configure --with-hydra-bss=ssh,rsh,fork,slurm

Process-core Binding

On supported platforms, Hydra automatically configures available process-core binding capability (currently using PLPA or hwloc). We support multiple levels of allocation strategies:

  • Basic allocation strategies: There are two forms of basic allocation: (i) based on a round-robin mechanism using the OS specified processor IDs, and (ii) based on a user-defined mapping. Further, for the user-defined mapping, two schemes are provided---command-line and host-file based. The command-line scheme lets the user specify a common-mapping for all physical nodes on the command line. The host-file scheme is the most general and lets the user specify the mapping for each node separately.

The modes of process binding in the basic allocation are: round-robin ("rr") and user-defined ("user").

shell$ mpiexec -binding rr -f hosts -n 8 ./app

Within the user-defined binding, two modes are supported: command-line and host-file based. The command-line based mode can be used as follows:

shell$ mpiexec -binding user:0,3 -f hosts -n 4 ./app

If a machine has 4 processing elements, and only two bindings are provided (as in the above example), the rest are padded with (-1), which refers to no binding. Also, the mapping is the same for all machines; so if the application is run with 8 processes, the first 2 processes on "each machine" are bound to processing elements as specified.

The host-file based mode for user-defined binding can be used by the "map=" argument on each host line. E.g.:

shell$ cat hosts
   donner:4    binding=user:0,-1,-1,3
   foo:4       binding=rr

Using this method, each host can be given a different mapping. Any unspecified mappings are treated as (-1), referring to no binding.

Command-line based mappings are given a higher priority than the host-file based mappings. So, if a mapping is given at both places, the host-file mappings are ignored.

  • Topology-aware allocation strategies: These are a bit more intelligent in that they try to understand the system processing unit topology and assign processes in that order.

Different modes of process binding in the topology-aware allocation are:

CPU based options:

  1. cpu -- use all CPU resources
  2. cpu:sockets,cores,threads -- use all CPU resources
  3. cpu:sockets,cores -- avoid sharing a core
  4. cpu:sockets -- avoid sharing a socket

Memory based options:

  1. cache:l3,l2,l1 -- use all cache resources
  2. cache:l3,l2 -- avoid sharing l1 cache
  3. cache:l3 -- avoid sharing l2 cache
  4. cache:none -- avoid sharing l3 cache
shell$ mpiexec -binding cpu -f hosts -n 8 ./app


shell$ mpiexec -binding cpu:sockets,cores -f hosts -n 8 ./app


shell$ mpiexec -binding cache:l3,l2,l1 -f hosts -n 6 ./app

Consider the following layout of processing elements in the system (e.g., two nodes, each with two processors, and each processor with two cores). Suppose the Operating System assigned processor IDs for each of these processing elements are as shown below:

__________________________________________      __________________________________________
|  _________________    _________________  |    |  _________________    _________________  | 
| |  _____   _____  |  |  _____   _____  | |    | |  _____   _____  |  |  _____   _____  | |
| | |     | |     | |  | |     | |     | | |    | | |     | |     | |  | |     | |     | | |
| | |     | |     | |  | |     | |     | | |    | | |     | |     | |  | |     | |     | | | 
| | |  0  | |  2  | |  | |  1  | |  3  | | |    | | |  0  | |  2  | |  | |  1  | |  3  | | |
| | |     | |     | |  | |     | |     | | |    | | |     | |     | |  | |     | |     | | |
| | |_____| |_____| |  | |_____| |_____| | |    | | |_____| |_____| |  | |_____| |_____| | |
| |_________________|  |_________________| |    | |_________________|  |_________________| |
|__________________________________________|    |__________________________________________|

In this case, the binding options are as follows:

  1. rr 0, 1, 2, 3 (use the order provided by the OS)
  2. cpu 0, 2, 1, 3 (using all resources)
  3. cpu:sockets 0, 1 (not sharing the socket)
  4. user as defined by the user

Binding options can also be controlled with the environment variable HYDRA_BINDING.

X Forwarding

X-forwarding is specific to each bootstrap server. Some servers do it by default, while some don't. For ssh, this is disabled by default. To force-enable it, you should use the option -enable-x to mpiexec.

shell$ mpiexec -enable-x -f hosts -n 4 ./app

Resource Manager Integration

Hydra provides capability to integrate with different resource managers. The default is "pbs". You can pick these through the mpiexec option -rmk:

shell$ mpiexec -rmk pbs ./app

By default, mpiexec will coordinate with PBS to use all the allocated nodes to run the app. You can also force it to run the application using a different number of processes or on a different set of nodes using:

shell$ mpiexec -rmk pbs -n 4 -f ~/hosts ./app

This can also be controlled by using the HYDRA_RMK environment variable.

Checkpoint/Restart Support

Hydra provides checkpoint/restart capability. Currently, only BLCR is supported. To use checkpointing include the -ckpointlib option for mpiexec to specify the checkpointing library to use and -ckpoint-prefix to specify the directory where the checkpoint images should be written:

shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -f hosts -n 4 ./app

While the application is running, the user can request for a checkpoint at any time by sending a SIGUSR1 signal to mpiexec.

You can also automatically checkpoint the application at regular intervals using the mpiexec option -ckpoint-interval to specify the number of seconds between checkpoints:

shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -ckpoint-interval 3600 -f hosts -n 4 ./app

The checkpoint/restart parameters can be controlled with the environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and HYDRA_CKPOINT_INTERVAL.

Each checkpoint generates one file per node. Note that checkpoints for all processes on a node will be stored in the same file. Each time a new checkpoint is taken an additional set of files are created. The files are numbered by the checkpoint number. This allows the application to be restarted from checkpoints other than the most recent. The checkpoint number can be specified with the -ckpoint-num parameter. To restart a process:

shell$ mpiexec -ckpointlib blcr -ckpoint-prefix /home/buntinas/ckpts/app.ckpoint -ckpoint-num 5 -f hosts -n 4

Note that by default, the process will be restarted from the first checkpoint, so in most cases, the checkpoint number should be specified.

Demux Engines

Hydra supports multiple I/O demux engines including poll and select. The default is "poll". You can pick these through the mpiexec option -demux:

shell$ mpiexec -demux select -f hosts -n 4 ./app

This can also be controlled by using the HYDRA_DEMUX environment variable.


Hydra supports parallel debugging with totalview. You can debug the MPI application with totalview by launching it as:

shell$ totalview mpiexec -a -f hosts -n 4 ./app

The "-a" option is a totalview parameter which tells it that the arguments after that need to be passed to mpiexec.

Hydra in hybrid environments

Hydra can be used to launch other process managers as well, such as a UPC launcher, for example:

shell$ mpiexec -n 2 -ranks-per-proc=4 upcrun -n 4 ./app

This launches two instances of upcrun, each of which is expected to launch 4 application processes (two subgroups of processes). Hydra needs the -ranks-per-proc argument to tell it how many MPI ranks it needs to allocate to each group of processes.

If the internal nested environment also needs to use Hydra as a launcher, but not as a process manager, this can be set using:

shell$ mpiexec -n 2 -ranks-per-proc=4 mpiexec -n 4 ./app