Hydra Process Management Framework
NOTE: this is a rough first draft of the Hydra design as it stands right now (03/15/2009). It is possible that this might change as the project matures, and there are periods where this document is not in sync with the design changes, but we will try to keep up.
The Hydra framework has the following basic components:
- Job Launcher (e.g., mpiexec)
- Control system
- Process manager
- Bootstrap server (e.g., ssh, fork, pbs, slurm, sge)
- Process Management proxy
- I/O demux engine
Architecturally, their relationship is shown in the below figure:
TODO: This figure needs to be added.
Job Launcher: The main responsibility of this layer is to collect information from the user with respect to the application, where to launch the processes, mappings of processes to cores, reading stdin and forwarding it to process (COMM_WORLD) rank 0, reading stdout/stderr from different processes and directing it appropriately.
Control System: The control system provides plug-in capabilities for resource manager extensions. For example, if the application needs to allocate nodes on a system before launching a job, the control system plays this part. Similarly, the control system can also allow for decoupled job launching in cases where a single system reservation is to be used for multiple jobs. In the current implementation, the control system is very simple and does not provide any of these functionalities.
Process Manager: The process manager deals with providing processes with the necessary environment setup as well as the MPICH2 Process Management Interface (PMI) functionality. Currently, only PMI-1 is supported, but we plan to add PMI-2 as well (which is currently being drafted).
Bootstrap Server: The bootstrap server mainly functions as a pre-configured daemon system that allows the upper layers to launch processes throughout the system. For example, the ssh bootstrap server forks off processes, each of which performs an ssh to a single machine to launch one process.
Process Management Proxy: The process management proxy is basically a helper agent that is spawned at each node on the system to assist the process manager with process spawning, process cleanup, signal forwarding, I/O forwarding, PMI forwarding, etc. It can basically perform any task the process manager can do, thus, even a hierarchy of process management proxies can be created, where each proxy acts as a process manager for its sub-tree.
I/O Demux Engine: This is basically a convenience component, where different components can "register" their file-descriptors and the demux engine can wait for events on any of these descriptors. This gives a centralized event management mechanism; so we do not need to have different threads blocking for different events. The I/O demux engine uses a synchronous callback mechanism. That is, with each file descriptor, the calling processes provides a function pointer to the function that has to be called when an event occurs on the descriptor. The demux engine blocks for events on all its registered file descriptors and calls the appropriate callbacks whenever there is an event. This component is very useful in the implementation, but does not play a crucial role in the architecture itself.
Implementation and Control Flow
There are three basic executable classes in the current model: job launcher process, process management proxy and the application process (we will just consider one executable for simplicity).
Job Launcher Process: The job launcher process (e.g., mpiexec) takes the user arguments and adds them into the HYD_Handle structure (defined in src/pm/hydra/include/hydra.h). It also creates callback handlers for stdin, stdout and stderr and adds them to the HYD_Handle structure, before handing over control to the control system layer and eventually to the process manager layer. The process manager creates an executable argument list to launch the proxy. The argument list contains information about the application itself and its arguments, the environment that needs to be setup for the application and the partition of processing elements that this proxy would be responsible for. The environment also contains additional information that can allow the PMI client library on the application to connect to the process manager. The executable argument list is passed down to the bootstrap server with information on where all to launch this executable; in this case, the executable that is handed over to the bootstrap server is the proxy, but the bootstrap server itself does not interpret this information and can be used transparently for the proxy or the application executable directly. The bootstrap server launches the processes on the appropriate nodes, and keeps track of the stdout, stderr and stdin end points for these processes. This information is passed back to the upper layers. Once the processes have been launched, the job launcher executable registers the stdout, stderr, stdin file descriptors with the demux engine and waits for events on them. Once all these sockets have been closed, it passes the control back to the bootstrap server that waits for each of the spawned processes to terminate and returns the exit statuses of these processes.
Process Management Proxy: The process management proxy just reads the environment that needs to be setup for each process, forks off the processes and sets up the required environment. The current proxy implementation does not allow for hierarchical proxies because of an incorrect interface design, but theoretically, it can use the bootstrap server itself to launch processes in a tree-like fashion as well. The proxy forwards I/O information to/from the application processes to its parent proxy or the main process manager. Currently, the proxy does not handle any PMI functionality, but this will soon be added as well.
Process Cleanup During Faults: There are two types of faults as far as the Hydra framework is concerned: in one case, either one of the application processes dies or calls an abort. In the second case, the timeout specified by the user expires. In both cases, Hydra tries to clean up all the remaining application processes. Unfortunately, not all bootstrap servers reliably forward signals to processes (e.g., many ssh implementations do not forward SIGINT correctly all the time). Thus, the cleanup has to be manually done by the framework. For this, each proxy keeps track of the PIDs of the processes that it spawns. Together with stdout/stderr/stdin, the proxies also listen for connection requests from the parent proxy or process manager. If the process manager wants to cleanup the application processes, it can connect to the proxy, and send it a command to clean up processes associated with it. The proxy does this and passes back the appropriate exit code. The current implementation does this cleanup connection sequentially, but this can be made into a tree.
Short Term Future Plans
- Bootstrap Servers: We currently support ssh and fork. We plan to add SLURM, PBS and SGE soon. This is being tracked in this ticket: .
- Dynamic Processes: Dynamic processes are not currently supported. This is being tracked in this ticket: .
- Process-core Mapping Ability: Currently, Hydra leaves the scheduling of processes to cores to the OS. Allowing applications to specify mappings needs to be added. This is being tracked in this ticket: .
- Hierarchical Proxies: Currently the proxies only follow a single-level hierarchy as the interface between the proxies and the bootstrap servers is broken. This is being tracked in this ticket: .
- Offloading PMI Functionality to the Proxies: Though the proxies handle process launching, the PMI functionality is still based on a single master, which can be a bottleneck for large systems. Some of the PMI functionality needs to be offloaded to the proxies. This is being tracked in this ticket: .
- Pre-launch Capability for Proxies: In some environments (e.g., windows), by default there is no bootstrap server capability setup. In these kind of environments, it is useful to have the proxies to be pre-launched either manually or as persistent daemons. This is being tracked in this ticket: .
- Pre-connect Capability for Proxies: Most current implementations of PMI servers have either used centralized PMI servers or have taken a lot of pain to minimize the number of connections. In a pre-connect capability (only applicable for pre-launched proxies), we can firstly cut-off the connection time, but we can also do all-to-all connections of proxies on networks that are resource scalable (e.g., non-connection oriented). This not only allows us to use the high-speed communication capabilities of these networks, but also lets us get any PMI key value in a single hop instead of multiple hops. This is being tracked in this ticket: .