For the formal specification of PMI-2.
See PMI v2 API for some discussions about possible designs and some issues.
(WDG - I find it valuable to list objectives and requirements first. Here's an initial list. They can and should be expanded, and the consequences of each understood)
- Scalable - Semantics of operations must permit scalable implementation
- Efficient - Must provide MPI implementation with the information that it needs without requiring potentially expensive steps.
- Complete - Must support all of MPI, including dynamic processes
- Robust - Must handle failures and aborts, including any resources acquired by the MPI application.
- Correct - Must avoid race conditions in the design
- Portable - Must not assume a particular environment such as POSIX
The basic idea is that all interaction with external process and resource managers, as well as the exchange of any information required to contact other processes in the same parallel job, takes place through the process management interface or PMI.
There are four separate sets of functionality:
- Creating, connecting with, and exiting parallel jobs
- Accessing information about the parallel job or the node on which a process is running
- Exchanging information used to connect processes together
- Exchanging information related to the MPI Name publishing interface
While these can be combined within a single, full-featured process manager, in many cases, each set of services may be provided by a different actor. For example, creating processes may be managed by a system such as PBS or LoadLeveler. The Name publishing service may be accomplished by reading and writing files in a shared directory. Information about the parallel job and the node may be provided by mpiexec, and the connection information may be handled with a scalable, distributed tuple-space system.
There are three groupings of processes that are important in understanding the process manager interface.
- An MPI process; this is usually an OS process (but need not be; an example would be threads in a language that keep named globals thread-private by default).
- This is a collection of processes managed together by a process manager that understands parallel applications. A job contains all of the processes in a single MPI_COMM_WORLD and no more. That is, two processes are in the same job if and only if they are in the same MPI_COMM_WORLD
- Connected Jobs
- This is a collection of jobs that have established a connection through the use of PMI_Job_Spawn or PMI_Job_Connect. If any process in a job establishes a connection with any process in another job, then all processes in both jobs are connected. That is, connections are established between jobs, not processes. This is necessary to implement the MPI notion of connected processes.
In addition, it is desirable to allow the PMI client interface to be implemented with a dynamically loadable library. This allows an executable to load a version of PMI that is compatible with whatever process management system will be running the application, without requiring the process management systems to implement the same communication (or wire) protocol. The consequence of this is that the pmi.h header file is standardized across all PMI client implementations (in PMI v1, each PMI client implementation could, like MPI, define its own header file).
The PMI interface represents most data as printable characters rather than as raw binary. This simplifies support for systems with heterogeneous data representations and also simplifies the "wire" protocol. The character set for PMI v2 is UTF-8; this is a variable-length representation that contains ASCII as a subset and for which the null byte is always a string terminator. All character data in the PMI v2 interface is in the UTF-8 character set. The rationale for using UTF-8 over ASCII is to avoid problems with internationalization in the case where commands return user-defined error strings.
Commands are exchanged between the PMI client and server in a simple, key equal value format. There are a number of predefined keys that are used consistently in the commands; additional keys may be defined as necessary to provide new services (though the expectation is that few if any further changes will be required to this specification).
The PMI server indicates whether an error was detected by setting the value of the key rc to a non-negative integer (defined below). If the value is positive (indicating an error), the command may optionally include the key errmsg, which provides an error message string that the client may choose to print. This allows the client side of the code to give the user more detailed information about a problem.
In version 1 of the PMI wire protocol, PUT/GET operations were blocking and were implemented as request-response transactions over the wire. The server could assume that a second request would not arrive over the wire, from the same client, before the first request was satisfied. This is not appropriate for expensive operations such as spawn (or even GET on some implementations), since it may take a great deal of time (particularly on a batch scheduled system) for the new processes to be launched. This requires that spawn be a split operation.
To allow for this, in PMI-2, PUT/GET operations are nonblocking, which are only complete after a FENCE call. The PMI wire protocol needs to allow for a request id to be associated with each request, and provide this request id in its response. The id need only be unique to the process since data comes to a particular process. This is provided in the field "reqid". Responses to this request include this field with the same value. reqid is short for "request id", and identifies the operation associated with the request. The client must use unique values for reqid to distinguish concurrent operations.
Note that FENCE is locally blocking, and collective, but need not collectively synchronize.
PMI Wire Protocol Version 2
- Boolean values are always one of the strings "
true" or "
- As mentioned previously, return codes (
rc) are non-negative integers that may be represented by a int32_t integer in C (using a signed integer gives more flexibility in the calling code).
The client establishes a connection to the server. The client may choose to, and/or the server may require that the client, use a secure connection, such as OpenSSL. This part of the wire protocol is not defined. It is recommended that clients and servers either use OpenSSL or use plain sockets, making use of the authentication extensions defined below on the fullinit command.
In order to initialize the connection between the client and the server, the client needs to know how to establish the connection. These may be specified by the following environment variables:
- Use this FD (encoded as a character)
- Use this host and port, encoded as hostname:port-number, i.e., myhost.edu:12674
Once the socket connection is established, the client sends commands to the server in the following format:
where length is exactly 6 ASCII digits (blank padded on left if necessary) giving the length and the
command is exactly that many bytes (not characters, since PMI-2 uses UTF-8 and there may be some multi-byte characters). The
commandline is of the form
If a semicolon needs to be part of a key or value, it needs to be escaped by being doubled. No other characters (not even new lines) are special. The semicolon was chosen because it is often used to terminate commands, and hence is rarely used in the sort of values that are likely to be communicated with PMI.
The first command the client can send to the server is the init command
cmd=init;pmi-version=n;pmi-subversion=m;maxkeylen=kl;maxvallen=vl; cmd=fullinit-response;pmi-version=n;pmi-subversion=m;size=s;appnum=a; spawner-jobid=id;verbose=bool;pmi-jobid=j; rc=code;errmsg=string;
Command names are more limited and must be letters, digits, and hyphens only. Key names are limited to letters, digits, hyphens, and underscores. To be precise, the regular expression for keys is
Note that values are not limited to those characters, and may contain spaces, equal signs, newlines, and even nulls (but not semicolons).
Implementations must be prepared to accept values that contain the = character. See notes on the character set below.
Note that the semicolon is the command terminator, not separator. This slightly simplifies processing of commands.
There are many predefined commands. For most commands, there is a response; in that case, the name of the command adds -response. Responses normally include a return code; that is given by a key of rc. A typical sequence is to send
For any commands that occur after MPI initialization, there is an additional thrid field on both the command and the response. This is used to support thread safety, and was described in more detail above. This field should be the first field after the command.
In some cases, a value or a command may require an excessive number of characters. Rather than require that the PMI client and server support arbitrarily long commands, a command may be split into multiple messages by adding a
at the end and a
at the beginning of the continuation. The id value is arbitrary, but should be sufficient to identify a stream of commands. For example, simply using the process or thread id will usually be sufficient.
Many commands have a response form that includes an errmsg=string item. If there is no error (rc=0), this field may be omitted.
The following commands and their arguments are defined. These are the ones needed to implement the PMI verison 2 API. The first command is the one sent by the client to the PMI server; the second (always a -response form) is sent back to the client. The key rc is used consistently for a return code. A zero return code is success; non-zero is failure and are documented below. Note that some commands have an instance-specific number of values (such as info keys or arguments to a command that is to be executed).
After the standard init with version string,
cmd=fullinit;pmijobid=string;pmirank=r;threaded=bool;authtype=name;authinfo=string; cmd=fullinit-response;pmi-version=n;pmi-subversion=m;rank=r;size=s;appnum=a; spawner-jobid=string;debugged=bool; pmiverbose=flag;rc=code;errmsg=string;
(balaji: what's the difference between debugged and pmiverbose?)
The client process sends a request to begin the initialization by using the fullinit command. This has three optional keys:
This is a string that allows the PMI server to identify processes that belong to the same parallel job. In the simple PMI implementation, if the processes are started by mpiexec, the environment variable PMI_JOBID is set with this string; the processes can check in using this string. This is only required if the server will require a way to identify the processes in a job; if no PMI_JOBID is set, then the pmijobid key is not required.
Another use for pmijobid is for processes that are started outside of the process management system but that still need the PMI services, such as the information on the rank and size of the job. This may occur if, for example, a parallel debugger starts the processes.
The rank in MPI_COMM_WORLD of this process. This is provided only if the process already knows this value; that may happen if, for example, the processes are started by some outside system, or if the environment variable PMI_RANK is set.
If true, then PMI will require thread ids in messages in order to provide thread safety. If this false, the wire protocol need not provide thread id values. This provides a modest optimization.
An authentication type; this is a string name that the client and server agree upon. See the example below for details. This is optional; if the server does not need authentication (for example, the connection is provided through a pre-existing file descriptor), this field is not required.
Information to be used to establish authentication.
If the job was spawned, the key spawner-jobid is given with the job id of the spawner. If the process was not spawned (e.g., created with mpiexec), then this key is not provided in the fullinit-response. This jobid can be used in job-connect.
The authtype and authinfo strings are used to allow the PMI client and server to negotiate an authorization method. In relatively secure environments, particularly ones with a shared secret, this can use a challenge-response handshake (which will take place before the fullinit-response command is returned). In this case, the sequence looks something like this:
(authinfo in this case is not needed and thus not sent).
The server takes this and returns the following:
where "n" is a random integer. Then the client forms a string by concatenating the shared secret with n and then creating the sha-256 hash of that value. It sends to the server the command
where string is the sha-256 hash. At this point, the server can decide if it is willing to accept the client. If so, it returns the fullinit-response command; otherwise it closes the connection (and may wish to log the failed attempt).
(goodell: we may want to use SHA-256 or something similar that doesn't yet have a known attack against it. MD5 and SHA-1 have known problems.)
(gropp:I've switched MD5 to SHA-256)
For additional security, the random integer itself would be encrypted, and some function of the integer would be used by the client (this is what Kerberos does).
This is a simple example. Note that in this case, the connection itself is not secured or encrypted. Other strategies may require additional exchanges; this is easily accomplished by sending cmd=auth-response between the client and server until one or the other indicates completion of the handshake by sending the auth-response-complete command.
There is no response for this command.
(goodell: these job commands refer to a job id with different key names in different commands (jobid/name/id), could we standardize on "jobid"?)
(gropp: I have attempted to standardize these as jobid.)
This is a complex command because there are many fields and the command may take a long time to execute. A response will be sent when the operation completes; in a multi-threaded environment, the PMI client implementation must allow other threads to make PMI requests while waiting for the spawn request to complete. The command spawn and each spawn-cmd must be sent consecutively (e.g., no other thread may access the PMI wire until the entire command is sent). However, once the spawn-cmd is sent, other threads should be allowed to make PMI calls. However, the PMI implementation must be prepared to accept a spawn-response command at any time.
cmd=spawn;thrid=string;ncmds=count;preputcount=n;ppkey0=name;ppval0=string;...; cmd=spawn-cmd;thrid=string;maxprocs=n;argc=narg;argv0=name;argv1=name;...; infokeycount=n;infokey0=key;infoval0=string;...; ... one for each command (this is spawn multiple) cmd=spawn-response;third=string;rc=code;errmsg=string; jobid=string;nerrs=count;err0=e0;err1=e1;...;
Jobid in the spawn-response command is the JobId that may be used in PMI_Job_Connect. This information is may be needed by other processes in order to use PMI_KVS_Get.
The thrid for the spawn and spawn-cmd commands must be the same (for the same spawn command).
The info keys include the ones defined in the MPI-2 specification (Section 5.3.4) and in addition include these
A timeout limit, in milliseconds (integer)
A string containing credentials needed to start a job (this may include user, charge group, and pass phrases). Note that it is essential that this be encrypted if any sensitive information, such as passwords or pass phrases, are included. The PMI client and server should agree on the method for securing this information.
The number of anticipated threads per MPI Process (integer). This may be used by a resource manager in allocating processor resourses.
Note the use of infokeyi=key and infovaluei=value instead of key=value. This avoids both possible conflict with predefined names in the commands and with key names that contain special characters.
Question: Note that this supports the preput operations used in PMI version 1. We may want to change this to make that data arrive in the fullinit command when the job is spawned.
Question: Do we still need this, or is it unnecessary because the init step could return the job id (though it does not yet)?
The kvscopy value is yes if the PMI client needs to help the PMI servers share KVS information. This is necessary when the PMI servers for the jobs are different and have no way to connect. In this case, the commands kvsgetall and kvsputall are also used (note that these do not have PMI Client API equivalents).
cmd=kvs-getall;thrid=string;jobid=string; cmd=kvs-getall-paircount-response;thrid=string;npair=n; cmd=kvs-getall-pair-response;thrid=string;key=val;val=val; cmd=kvs-getall-response;thrid=string;rc=code;errmsg=string;
After the kvs-getall command, the server returns the number of KVS pairs with the kvs-getall-paircount-response command. Following that is one kvs-getall-pair-response for each key in the KVS space. The kvs-getall-response is sent after all key/value pairs are returned.
cmd=kvs-putall;thrid=string;jobid=string;npair=n; cmd=kvs-putall-pair;thrid=string;key=val;val=val; cmd=kvs-putall-response;thrid=string;rc=code;errmsg=string;
The client sends one kvs-putall-pair command for each of the keyval pairs. This approach helps ensure that individual messages are of reasonable length, since the number of pairs will often be at least as large as the number of processes in MPI_COMM_WORLD.
This command does not need an thrid because it can only be used when all processes are guaranteed to be able to issue a PMI Fence operation. Effectively, that can only happen before MPI_Init returns, so there can only be one thread processing these operations. However, a thrid is included to eliminate differences in handling commands.
Note that because of the requirements on PMI_KVS_Fence, PMI_KVS_Put cannot be used after MPI_Init, except possibly in collective functions on MPI_COMM_WORLD.
The jobid field is optional; if it is not given, then the job id for this process is assumed. The value of the jobid value must be one of the connected jobids.
The srcid field is a hint indicating which process might have PUT the corresponding key value pair. If the hint is incorrect, the server should still return the correct value (e.g., if process 1 performed a put of key=foo and value=bar, then later a get is performed with srcid=0 and key=foo, the server should respond with flag=TRUE and value=bar). If a negative value is passed in srcid, or if the field is omitted altogether, then the hint is ignored. This field is optional.
The thrid isn't necessary in practice but is included to simplify command handling.
Note that the PMI API defines some key names; others may be added. If the key is unknown or there is no associated value for that key, the value of flag is false. If the wait key is set with the value true, then the server will not respond until the value becomes available (in that case, flag will always be set to found). If the job exits before the value becomes available, the server will treat that as any other unexpected termination of the job.
cmd=name-publish;thrid=string;name=servicename;port=portname; infokeycount=n;infokey0=name;infoval0=string;infokey1=name;...; cmd=name-publish-response;thrid=string;rc=code;errmsg=string;
cmd=name-lookup;thrid=string;name=servicename; infokeycount=n;infokey0=name;infoval0=string;...; cmd=name-lookup-response;thrid=string;value=string;flag=found;rc=code;errmsg=string;
cmd=name-unpublish;thrid=string;name=servicename; infokeycount=n;infokey0=name;infoval0=string;infokey1=name;...; cmd=name-unpublish-response;thrid=string;rc=code;errmsg=string;
Together with synchronous communication between the MPI process and the PMI server, there can optionally be a separate connection for out-of-band messaging.
The PMI server expects at least one connect and a follow-on message which says "cmd=init pmi_version=..." from the client. This is the "regular" connection.
The client can optionally open a second connection and a follow-on message which says "cmd=init_oob signal=SIGSTOP pmi_version=...". This will be the "out-of-band" connection. The signal can be one of the signals supported on that platform or NONE. If the signal requested is NONE, the server will not signal the MPI process when a message is sent. Note that if the application uses the same signal for its own processing, then it can be a problem. But that part is out-of-scope for this proposal; the MPI process can figure out what signal to use based on some coordination with the application (e.g., environment variable).
All out-of-band communication happens on the out-of-band socket. Each OoB message is initiated by the server and sent to the MPI process and is of the form "cmd=checkpoint ..." or "cmd=abort ...".
The MPI process can either request for a signal during the initialization (in which case the PMI server has to send the message and follow it up with a signal) or request for no signal (in which case the MPI process has to continuously monitor this socket using either a SIGIO or a separate thread blocking on the socket).
If PMI was used solely for commands, any simple character set, such as ASCII, would be fine. However, some commands may return error messages and others, such as the name publishing routines, may need user-specified strings. To avoid problems with internationalization, PMI v2 uses UTF-8, which provides backward compatibility to ASCII. In particular, PMIv1, which used ASCII, used an UTF-8 subset. One feature of UTF-8 is that bytes that represent ASCII characters are unique - all other characters have at least the high bit set. This means that a character, such as the semicolon that PMIv2 uses as the terminator, can be found without worrying about whether the same byte is part of a longer UTF-8 multi-byte character - that can never happen. In particular, the PMIv2 code can simply copy message strings that use multibyte UTF-8 without needing to process them. More information on using UTF-8 may be found at http://www.cl.cam.ac.uk/~mgk25/unicode.html.
PMI Wireprotocol Error Codes
This is a non-exhaustive list of error codes from the PMI server. These cannot be MPI error codes because the PMI server is independent of any particular instance of MPI (even of MPICH). Error codes are non-negative.
Also 0, means no error
Communication failure with PMI server
An easy way to provide security for the PMI traffic is to use OpenSSL. Because this changes the way in which the socket is connected, we've added a PMI_SOCKTYPE environment variable. Note that a PMI implementation is free to require SSL or some other secure communication mechanism.
The length-message format use here is easy on the reader but can be awkward for the writer, particularly for the commands that may have large number of key/value pairs (such as spawn with many command-line arguments and info values). For lines that exceed this maximum length, there is a special operation to concatenate lines. PMI implementations may choose to bound the total size of a command (e.g., to be limited by the maximum command line in the typical shell, which is often around 64k).
Here are some thoughts for how to implement processing of the PMI wire protocol in the multithreaded case, particularly on the client side (where multiple threads may make blocking PMI calls).
Sending (from the client) is easy. We may assume that the server is working as fast as possible to process requests, so any PMI call can enter a PMI-write critical section, perform the write, then exit the critical section. If the write blocks, that's ok, as the server will soon unblock it, and no other PMI call that needs to write would be able to make progress.
Receiving is more difficult.
Define a routine,
PMIR_Progress( int mythrid ), that reads from the PMI fd and processes each message. If the message is for the specified
mythrid, then exit after that message is read. Otherwise, read the message and enqueue it by
thrid; signal (using a condition variable) the relevant thread. Note that because all of the PMI calls are blocking, at most one message per thread will be pending. Thus, copying out and enqueuing the message for each thread is a small burden and simplifies the implementation.
If a routine enters
PMIR_Progress and some routine is already using
PMIR_Progress, then enter a condition wait.