Difference between revisions of "Hydra Process Management Framework"

From Mpich
Jump to: navigation, search
(New page: '''{{color|red|NOTE: this is a rough first draft of the Hydra design as it stands right now (03/15/2009). It is possible that this might change as the project matures, and there are period...)
 
(Ongoing Work)
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
'''{{color|red|NOTE: this is a rough first draft of the Hydra design as it stands right now (03/15/2009). It is possible that this might change as the project matures, and there are periods where this document is not in sync with the design changes, but we will try to keep up.}}'''
+
__TOC__
  
 
== Overall Framework ==
 
== Overall Framework ==
Line 5: Line 5:
 
The Hydra framework has the following basic components:
 
The Hydra framework has the following basic components:
  
# Job Launcher (e.g., mpiexec)
+
# User Interface (UI, e.g., mpiexec)
# Control system
+
# Resource Management Kernel (RMK)
 
# Process manager
 
# Process manager
 
# Bootstrap server (e.g., ssh, fork, pbs, slurm, sge)
 
# Bootstrap server (e.g., ssh, fork, pbs, slurm, sge)
 +
# Process Binding (e.g., plpa)
 +
# Communication Subsystem (e.g., IB, MX)
 
# Process Management proxy
 
# Process Management proxy
 
# I/O demux engine
 
# I/O demux engine
Line 14: Line 16:
 
Architecturally, their relationship is shown in the below figure:
 
Architecturally, their relationship is shown in the below figure:
  
'''{{color|blue|TODO: This figure needs to be added.}}'''
+
[[Image:hydra_arch.png|280px|thumb|left|Source: [[Image:hydra_arch.fig.txt]]]]
  
 +
'''User Interface:''' The main responsibility of this layer is to collect information from the user with respect to the application, where to launch the processes, mappings of processes to cores, reading stdin and forwarding it to the appropriate process(es), reading stdout/stderr from different processes and directing it appropriately.
  
'''{{color|black|Job Launcher:}}''' The main responsibility of this layer is to collect information from the user with respect to the application, where to launch the processes, mappings of processes to cores, reading stdin and forwarding it to process (COMM_WORLD) rank 0, reading stdout/stderr from different processes and directing it appropriately.
+
'''Resource Management Kernel (RMK):''' The RMK provides plug-in capabilities to interact with resource managers (e.g., Torque, Moab or Cobalt). For example, if the application needs to allocate nodes on a system before launching a job, the RMK plays this part. Similarly, the RMK can also allow for decoupled job launching in cases where a single system reservation is to be used for multiple jobs. In the current implementation, the RMK is very simple and does not provide any of these functionalities.
  
'''{{color|black|Control System:}}''' The control system provides plug-in capabilities for resource manager extensions. For example, if the application needs to allocate nodes on a system before launching a job, the control system plays this part. Similarly, the control system can also allow for decoupled job launching in cases where a single system reservation is to be used for multiple jobs. In the current implementation, the control system is very simple and does not provide any of these functionalities.
+
'''Process Manager:''' The process manager deals with providing processes with the necessary environment setup as well as the primary process management functionality. For example, the "pmiserv" process manager provides the MPICH PMI (Process Management Interface) functionality. Currently, only PMI-1 is supported, but we plan to add PMI-2 as well (which is currently being drafted). Other process managers could support other interfaces.
  
'''{{color|black|Process Manager:}}''' The process manager deals with providing processes with the necessary environment setup as well as the MPICH2 Process Management Interface (PMI) functionality. Currently, only PMI-1 is supported, but we plan to add PMI-2 as well (which is currently being drafted).
+
'''Process Management Proxy:''' The process management proxy is basically a helper agent that is spawned at each node on the system to assist the process manager with process spawning, process cleanup, signal forwarding, I/O forwarding, and any process manager specific functionality as well. It can basically perform any task the process manager can do, thus, even a hierarchy of process management proxies can be created, where each proxy acts as a process manager for its sub-tree.
  
'''{{color|black|Bootstrap Server:}}''' The bootstrap server mainly functions as a pre-configured daemon system that allows the upper layers to launch processes throughout the system. For example, the ssh bootstrap server forks off processes, each of which performs an ssh to a single machine to launch one process.
+
'''Bootstrap Server:''' The bootstrap server mainly functions as a pre-configured daemon system that allows the upper layers to launch processes throughout the system. For example, the ssh bootstrap server forks off processes, each of which performs an ssh to a single machine to launch one process.
  
'''{{color|black|Process Management Proxy:}}''' The process management proxy is basically a helper agent that is spawned at each node on the system to assist the process manager with process spawning, process cleanup, signal forwarding, I/O forwarding, PMI forwarding, etc. It can basically perform any task the process manager can do, thus, even a hierarchy of process management proxies can be created, where each proxy acts as a process manager for its sub-tree.
+
'''Processing Binding:''' The process binding component essentially deals with extracting the system architecture information (such as the number of processors, cores and SMT threads available, their topology, shared cache, etc.) and binding processes to different cores in a portable manner. PLPA is one such architecture that is already used in Hydra, but it provides only limited information. We are currently working on a more enhanced implementation of this component.
  
'''{{color|black|I/O Demux Engine:}}''' This is basically a convenience component, where different components can "register" their file-descriptors and the demux engine can wait for events on any of these descriptors. This gives a centralized event management mechanism; so we do not need to have different threads blocking for different events. The I/O demux engine uses a synchronous callback mechanism. That is, with each file descriptor, the calling processes provides a function pointer to the function that has to be called when an event occurs on the descriptor. The demux engine blocks for events on all its registered file descriptors and calls the appropriate callbacks whenever there is an event. This component is very useful in the implementation, but does not play a crucial role in the architecture itself.
+
'''Communication Subsystem:''' The communication sub-system is a way for different proxies to communicate with each other in a scalable manner. This is only relevant for the pre-launched and pre-connection proxies that are described below. This component provides a scalable communication mechanism irrespective of the system scale (e.g., InfiniBand UD based or Myrinet MX based).
 +
 
 +
'''I/O Demux Engine:''' This is basically a convenience component, where different components can "register" their file-descriptors and the demux engine can wait for events on any of these descriptors. This gives a centralized event management mechanism; so we do not need to have different threads blocking for different events. The I/O demux engine uses a synchronous callback mechanism. That is, with each file descriptor, the calling processes provides a function pointer to the function that has to be called when an event occurs on the descriptor. The demux engine blocks for events on all its registered file descriptors and calls the appropriate callbacks whenever there is an event. This component is very useful in the implementation, but does not play a crucial role in the architecture itself.
  
  
 
== Implementation and Control Flow ==
 
== Implementation and Control Flow ==
  
There are three basic executable classes in the current model: job launcher process, process management proxy and the application process (we will just consider one executable for simplicity).
+
There are three basic executable classes in the current model: UI process, process management proxy and the application process.
 +
 
 +
[[Image:hydra.png|border|700px|left]]
 +
 
 +
'''UI Process:''' The UI process (e.g., mpiexec) takes the user arguments, environment settings, information about nodes where the process has to be launched and other such details provided by the user and passes them to the RMK. The RMK, if appropriate, interacts with the resource manager to either verify access to the nodes on which the processes need to be launched or to allocate the nodes, and passes this information to the process manager. The process manager creates an executable argument list to launch the proxy containing information about the application itself and its arguments, the environment that needs to be setup for the application and the partition of processing elements that this proxy would be responsible for. The environment also contains additional information that can allow the PMI client library on the application to connect to the process manager. The executable argument list is passed down to the bootstrap server with information on where all to launch this executable; in this case, the executable that is handed over to the bootstrap server is the proxy, but the bootstrap server itself does not interpret this information and can be used transparently for the proxy or the application executable directly. The bootstrap server launches the processes on the appropriate nodes, and keeps track of the stdout, stderr and stdin end points for these processes. This information is passed back to the upper layers. Once the processes have been launched, the UI executable registers the stdout, stderr, stdin file descriptors with the demux engine and waits for events on them. Once all these sockets have been closed, it passes the control back to the bootstrap server that waits for each of the spawned processes to terminate and returns the exit statuses of these processes.
  
'''{{color|black|Job Launcher Process:}}''' The job launcher process (e.g., mpiexec) takes the user arguments and adds them into the HYD_Handle structure (defined in src/pm/hydra/include/hydra.h). It also creates callback handlers for stdin, stdout and stderr and adds them to the HYD_Handle structure, before handing over control to the control system layer and eventually to the process manager layer. The process manager creates an executable argument list to launch the proxy. The argument list contains information about the application itself and its arguments, the environment that needs to be setup for the application and the partition of processing elements that this proxy would be responsible for. The environment also contains additional information that can allow the PMI client library on the application to connect to the process manager. The executable argument list is passed down to the bootstrap server with information on where all to launch this executable; in this case, the executable that is handed over to the bootstrap server is the proxy, but the bootstrap server itself does not interpret this information and can be used transparently for the proxy or the application executable directly. The bootstrap server launches the processes on the appropriate nodes, and keeps track of the stdout, stderr and stdin end points for these processes. This information is passed back to the upper layers. Once the processes have been launched, the job launcher executable registers the stdout, stderr, stdin file descriptors with the demux engine and waits for events on them. Once all these sockets have been closed, it passes the control back to the bootstrap server that waits for each of the spawned processes to terminate and returns the exit statuses of these processes.
+
'''Process Management Proxy:''' The process management proxy sets up the required environment and forks off application processes. Proxies can be setup in a hierarchical fashion. They forward I/O information to/from the application processes to a parent proxy or the main process manager. The proxy implements the Process Management Interface (PMI) API and keeps a local cache of the PMI KVS space. PMI calls made by the application that cannot be completed locally are forwarded upstream to the main process manager.
  
'''{{color|black|Process Management Proxy:}}''' The process management proxy just reads the environment that needs to be setup for each process, forks off the processes and sets up the required environment. The current proxy implementation does not allow for hierarchical proxies because of an incorrect interface design, but theoretically, it can use the bootstrap server itself to launch processes in a tree-like fashion as well. The proxy forwards I/O information to/from the application processes to its parent proxy or the main process manager. Currently, the proxy does not handle any PMI functionality, but this will soon be added as well.
+
'''Process Cleanup During Faults:''' There are two types of faults as far as the Hydra framework is concerned: in one case, either one of the application processes dies or calls an abort. In the second case, the timeout specified by the user expires. In both cases, Hydra tries to clean up all the remaining application processes. Unfortunately, not all bootstrap servers reliably forward signals to processes (e.g., many ssh implementations do not forward SIGINT correctly all the time). Thus, the cleanup has to be manually done by the framework. For this, each proxy keeps track of the PIDs of the processes that it spawns. Together with stdout/stderr/stdin, the proxies also listen for connection requests from the parent proxy or process manager. If the process manager wants to cleanup the application processes, it can connect to the proxy, and send it a command to clean up processes associated with it. The proxy does this and passes back the appropriate exit code. The current implementation does this cleanup connection sequentially, but this can be made into a tree.
  
'''{{color|black|Process Cleanup During Faults:}}''' There are two types of faults as far as the Hydra framework is concerned: in one case, either one of the application processes dies or calls an abort. In the second case, the timeout specified by the user expires. In both cases, Hydra tries to clean up all the remaining application processes. Unfortunately, not all bootstrap servers reliably forward signals to processes (e.g., many ssh implementations do not forward SIGINT correctly all the time). Thus, the cleanup has to be manually done by the framework. For this, each proxy keeps track of the PIDs of the processes that it spawns. Together with stdout/stderr/stdin, the proxies also listen for connection requests from the parent proxy or process manager. If the process manager wants to cleanup the application processes, it can connect to the proxy, and send it a command to clean up processes associated with it. The proxy does this and passes back the appropriate exit code. The current implementation does this cleanup connection sequentially, but this can be made into a tree.
+
== Ongoing Work ==
  
 +
There are several pieces of ongoing work as described below:
 +
 +
'''Integration with other programming models:''' A lot of the Hydra framework is independent of any parallel programming model. Several tools and utilities, are as easily utilizable by MPI as any other programming model. For example tools such as process launching, bootstrap servers, process-core binding, debugger support and many others are common to all parallel programming models and should not be repeated in each implementation. As a proof of concept, we are working on using Hydra with Global Arrays applications. We are also considering extending it to work with other models as well.
 +
 +
'''Pre-launched and Pre-connected Proxies:''' One of the issues on process management on large scale systems is the time taken to launch processes as well as initiating their interaction. For example, in the case of MPICH, even after the application processes have been launched, each process does not know anything about the other processes. Thus, it has to interact with the process manager which exchanges this information. This is typically at start time, so is not noticed in benchmark performance numbers, but as system sizes grow larger, this can be a major concern. In our pre-launched and pre-connected proxy model, we allow the proxies to be launched by a system administrator (or the user himself) when at "initialization time", i.e., one initialization is sufficient to run any number of jobs. Once launched, the proxies connect with each other. For scalable network stacks such as Myrinet MX and InfiniBand UD, the amount of resources used does not increase largely with the system size, so an all-to-all connection is preferred. Once connected, each proxy can interact with any other proxy to exchange information with a single hop, thus allowing the applications to be not affected by system scale in a lot of aspects.
 +
 +
'''Hierarchical Proxies:''' For the non-pre-launched proxies, currently, they are launched in a one-level hierarchy (each proxy handles the processes on its node and is connected to the main mpiexec).
  
 
== Short Term Future Plans ==
 
== Short Term Future Plans ==
  
# '''{{color|black|Bootstrap Servers:}}''' We currently support ssh and fork. We plan to add SLURM, PBS and SGE soon. This is being tracked in this ticket: [https://trac.mcs.anl.gov/projects/mpich2/ticket/443].
+
# '''Executable Staging:''' Currently, it is expected that the executable required to launch a process is available on all systems. This is not true for systems that do not share a common file-system. Executable staging ships the executable binary to the compute node at run-time.
# '''{{color|black|Dynamic Processes:}}''' Dynamic processes are not currently supported. This is being tracked in this ticket: [https://trac.mcs.anl.gov/projects/mpich2/ticket/444].
+
 
# '''{{color|black|Process-core Mapping Ability:}}''' Currently, Hydra leaves the scheduling of processes to cores to the OS. Allowing applications to specify mappings needs to be added. This is being tracked in this ticket: [https://trac.mcs.anl.gov/projects/mpich2/ticket/457].
+
[[Category:Design Documents]]
# '''{{color|black|Hierarchical Proxies:}}''' Currently the proxies only follow a single-level hierarchy as the interface between the proxies and the bootstrap servers is broken. This is being tracked in this ticket: [https://trac.mcs.anl.gov/projects/mpich2/ticket/445].
 
# '''{{color|black|Offloading PMI Functionality to the Proxies:}}''' Though the proxies handle process launching, the PMI functionality is still based on a single master, which can be a bottleneck for large systems. Some of the PMI functionality needs to be offloaded to the proxies. This is being tracked in this ticket: [https://trac.mcs.anl.gov/projects/mpich2/ticket/445].
 
# '''{{color|black|Pre-launch Capability for Proxies:}}''' In some environments (e.g., windows), by default there is no bootstrap server capability setup. In these kind of environments, it is useful to have the proxies to be pre-launched either manually or as persistent daemons. This is being tracked in this ticket: [https://trac.mcs.anl.gov/projects/mpich2/ticket/445].
 
# '''{{color|black|Pre-connect Capability for Proxies:}}''' Most current implementations of PMI servers have either used centralized PMI servers or have taken a lot of pain to minimize the number of connections. In a pre-connect capability (only applicable for pre-launched proxies), we can firstly cut-off the connection time, but we can also do all-to-all connections of proxies on networks that are resource scalable (e.g., non-connection oriented). This not only allows us to use the high-speed communication capabilities of these networks, but also lets us get any PMI key value in a single hop instead of multiple hops. This is being tracked in this ticket: [https://trac.mcs.anl.gov/projects/mpich2/ticket/445].
 

Latest revision as of 18:42, 24 June 2014

Overall Framework

The Hydra framework has the following basic components:

  1. User Interface (UI, e.g., mpiexec)
  2. Resource Management Kernel (RMK)
  3. Process manager
  4. Bootstrap server (e.g., ssh, fork, pbs, slurm, sge)
  5. Process Binding (e.g., plpa)
  6. Communication Subsystem (e.g., IB, MX)
  7. Process Management proxy
  8. I/O demux engine

Architecturally, their relationship is shown in the below figure:

User Interface: The main responsibility of this layer is to collect information from the user with respect to the application, where to launch the processes, mappings of processes to cores, reading stdin and forwarding it to the appropriate process(es), reading stdout/stderr from different processes and directing it appropriately.

Resource Management Kernel (RMK): The RMK provides plug-in capabilities to interact with resource managers (e.g., Torque, Moab or Cobalt). For example, if the application needs to allocate nodes on a system before launching a job, the RMK plays this part. Similarly, the RMK can also allow for decoupled job launching in cases where a single system reservation is to be used for multiple jobs. In the current implementation, the RMK is very simple and does not provide any of these functionalities.

Process Manager: The process manager deals with providing processes with the necessary environment setup as well as the primary process management functionality. For example, the "pmiserv" process manager provides the MPICH PMI (Process Management Interface) functionality. Currently, only PMI-1 is supported, but we plan to add PMI-2 as well (which is currently being drafted). Other process managers could support other interfaces.

Process Management Proxy: The process management proxy is basically a helper agent that is spawned at each node on the system to assist the process manager with process spawning, process cleanup, signal forwarding, I/O forwarding, and any process manager specific functionality as well. It can basically perform any task the process manager can do, thus, even a hierarchy of process management proxies can be created, where each proxy acts as a process manager for its sub-tree.

Bootstrap Server: The bootstrap server mainly functions as a pre-configured daemon system that allows the upper layers to launch processes throughout the system. For example, the ssh bootstrap server forks off processes, each of which performs an ssh to a single machine to launch one process.

Processing Binding: The process binding component essentially deals with extracting the system architecture information (such as the number of processors, cores and SMT threads available, their topology, shared cache, etc.) and binding processes to different cores in a portable manner. PLPA is one such architecture that is already used in Hydra, but it provides only limited information. We are currently working on a more enhanced implementation of this component.

Communication Subsystem: The communication sub-system is a way for different proxies to communicate with each other in a scalable manner. This is only relevant for the pre-launched and pre-connection proxies that are described below. This component provides a scalable communication mechanism irrespective of the system scale (e.g., InfiniBand UD based or Myrinet MX based).

I/O Demux Engine: This is basically a convenience component, where different components can "register" their file-descriptors and the demux engine can wait for events on any of these descriptors. This gives a centralized event management mechanism; so we do not need to have different threads blocking for different events. The I/O demux engine uses a synchronous callback mechanism. That is, with each file descriptor, the calling processes provides a function pointer to the function that has to be called when an event occurs on the descriptor. The demux engine blocks for events on all its registered file descriptors and calls the appropriate callbacks whenever there is an event. This component is very useful in the implementation, but does not play a crucial role in the architecture itself.


Implementation and Control Flow

There are three basic executable classes in the current model: UI process, process management proxy and the application process.

Hydra.png

UI Process: The UI process (e.g., mpiexec) takes the user arguments, environment settings, information about nodes where the process has to be launched and other such details provided by the user and passes them to the RMK. The RMK, if appropriate, interacts with the resource manager to either verify access to the nodes on which the processes need to be launched or to allocate the nodes, and passes this information to the process manager. The process manager creates an executable argument list to launch the proxy containing information about the application itself and its arguments, the environment that needs to be setup for the application and the partition of processing elements that this proxy would be responsible for. The environment also contains additional information that can allow the PMI client library on the application to connect to the process manager. The executable argument list is passed down to the bootstrap server with information on where all to launch this executable; in this case, the executable that is handed over to the bootstrap server is the proxy, but the bootstrap server itself does not interpret this information and can be used transparently for the proxy or the application executable directly. The bootstrap server launches the processes on the appropriate nodes, and keeps track of the stdout, stderr and stdin end points for these processes. This information is passed back to the upper layers. Once the processes have been launched, the UI executable registers the stdout, stderr, stdin file descriptors with the demux engine and waits for events on them. Once all these sockets have been closed, it passes the control back to the bootstrap server that waits for each of the spawned processes to terminate and returns the exit statuses of these processes.

Process Management Proxy: The process management proxy sets up the required environment and forks off application processes. Proxies can be setup in a hierarchical fashion. They forward I/O information to/from the application processes to a parent proxy or the main process manager. The proxy implements the Process Management Interface (PMI) API and keeps a local cache of the PMI KVS space. PMI calls made by the application that cannot be completed locally are forwarded upstream to the main process manager.

Process Cleanup During Faults: There are two types of faults as far as the Hydra framework is concerned: in one case, either one of the application processes dies or calls an abort. In the second case, the timeout specified by the user expires. In both cases, Hydra tries to clean up all the remaining application processes. Unfortunately, not all bootstrap servers reliably forward signals to processes (e.g., many ssh implementations do not forward SIGINT correctly all the time). Thus, the cleanup has to be manually done by the framework. For this, each proxy keeps track of the PIDs of the processes that it spawns. Together with stdout/stderr/stdin, the proxies also listen for connection requests from the parent proxy or process manager. If the process manager wants to cleanup the application processes, it can connect to the proxy, and send it a command to clean up processes associated with it. The proxy does this and passes back the appropriate exit code. The current implementation does this cleanup connection sequentially, but this can be made into a tree.

Ongoing Work

There are several pieces of ongoing work as described below:

Integration with other programming models: A lot of the Hydra framework is independent of any parallel programming model. Several tools and utilities, are as easily utilizable by MPI as any other programming model. For example tools such as process launching, bootstrap servers, process-core binding, debugger support and many others are common to all parallel programming models and should not be repeated in each implementation. As a proof of concept, we are working on using Hydra with Global Arrays applications. We are also considering extending it to work with other models as well.

Pre-launched and Pre-connected Proxies: One of the issues on process management on large scale systems is the time taken to launch processes as well as initiating their interaction. For example, in the case of MPICH, even after the application processes have been launched, each process does not know anything about the other processes. Thus, it has to interact with the process manager which exchanges this information. This is typically at start time, so is not noticed in benchmark performance numbers, but as system sizes grow larger, this can be a major concern. In our pre-launched and pre-connected proxy model, we allow the proxies to be launched by a system administrator (or the user himself) when at "initialization time", i.e., one initialization is sufficient to run any number of jobs. Once launched, the proxies connect with each other. For scalable network stacks such as Myrinet MX and InfiniBand UD, the amount of resources used does not increase largely with the system size, so an all-to-all connection is preferred. Once connected, each proxy can interact with any other proxy to exchange information with a single hop, thus allowing the applications to be not affected by system scale in a lot of aspects.

Hierarchical Proxies: For the non-pre-launched proxies, currently, they are launched in a one-level hierarchy (each proxy handles the processes on its node and is connected to the main mpiexec).

Short Term Future Plans

  1. Executable Staging: Currently, it is expected that the executable required to launch a process is available on all systems. This is not true for systems that do not share a common file-system. Executable staging ships the executable binary to the compute node at run-time.