Difference between revisions of "PMI-2"

From Mpich
Jump to: navigation, search
m (Reverted edits by Raffenet (talk) to last revision by Gropp)
Line 13: Line 13:
* Correct - Must avoid race conditions in the design
* Correct - Must avoid race conditions in the design
* Portable - Must not assume a particular environment such as POSIX
* Portable - Must not assume a particular environment such as POSIX
== Basic Concepts ==
The basic idea is that all interaction with external process and resource managers, as well as the exchange of any information required to contact other processes in the same parallel job, takes place through the process management interface or PMI.
There are four separate sets of functionality:
# Creating, connecting with, and exiting parallel jobs
# Accessing information about the parallel job or the node on which a process is running
# Exchanging information used to connect processes together
# Exchanging information related to the MPI Name publishing interface
While these can be combined within a single, full-featured process manager, in many cases, each set of services may be provided by a different actor. For example, creating processes may be managed by a system such as PBS or LoadLeveler. The Name publishing service may be accomplished by reading and writing files in a shared directory. Information about the parallel job and the node may be provided by mpiexec, and the connection information may be handled with a scalable, distributed tuple-space system.
There are three groupings of processes that are important in understanding the process manager interface.
:An MPI process; this is usually an OS process (but need not be; an example would be threads in a language that keep named globals thread-private by default).
:This is a collection of processes managed together by a process manager that understands parallel applications. A job contains all of the processes in a single MPI_COMM_WORLD and no more. That is, two processes are in the same job if and only if they are in the same MPI_COMM_WORLD
;Connected Jobs
:This is a collection of jobs that have established a connection through the use of PMI_Job_Spawn or PMI_Job_Connect. If any process in a job establishes a connection with any process in another job, then all processes in both jobs are connected. That is, connections are established between jobs, not processes. This is necessary to implement the MPI notion of connected processes.
In addition, it is desirable to allow the PMI client interface to be implemented with a dynamically loadable library.  This allows an executable to load a version of PMI that is compatible with whatever process management system will be running the application, without requiring the process management systems to implement the same communication (or ''wire'') protocol.  The consequence of this is that the <tt>pmi.h</tt> header file is standardized across all PMI client implementations (in PMI v1, each PMI client implementation could, like MPI, define its own header file).
=== Character Sets ===
The PMI interface represents most data as printable characters rather than as raw binary. This simplifies support for systems with heterogeneous data representations and also simplifies the "wire" protocol. The character set for PMI v2 is [http://en.wikipedia.org/wiki/UTF-8 UTF-8]; this is a variable-length representation that contains ASCII as a subset and for which the null byte is always a string terminator. All character data in the PMI v2 interface is in the UTF-8 character set. The rationale for using UTF-8 over ASCII is to avoid problems with internationalization in the case where commands return user-defined error strings.
==Wire Protocol==
===Basic Concepts===
Commands are exchanged between the PMI client and server in a simple,
key equal value format.  There are a number of predefined keys that
are used consistently in the commands; additional keys may be defined
as necessary to provide new services (though the expectation is that
few if any further changes will be required to this specification).
The PMI server indicates whether an error was detected by setting the
value of the key <tt>rc</tt> to a non-negative integer (defined
below).  If the value is positive (indicating an error), the command
may optionally include the key <tt>errmsg</tt>, which provides an
error message string that the client may choose to print.  This allows
the client side of the code to give the user more detailed information
about a problem.
===Nonblocking operations===
In version 1 of the PMI wire protocol, PUT/GET operations were blocking and were implemented as request-response transactions over the wire.  The server could assume that a second request would not arrive over the wire, from the same client, before the first request was satisfied.  This is not appropriate for expensive operations such as spawn (or even GET on some implementations), since it may take a great deal of time (particularly on a batch scheduled system) for the new processes to be launched.  This requires that spawn be a split operation.
To allow for this, in PMI-2, PUT/GET operations are nonblocking, which are only complete after a FENCE call.  The PMI wire protocol needs to allow for a request id to be associated with each request, and provide this request id in its response.  The id need only be unique to the process since data comes to a particular process.  This is provided in the field "<tt>reqid</tt>".  Responses to this request include this field with the same value.  <tt>reqid</tt> is short for "request id", and identifies the operation associated with the request.  The client must use unique values for <tt>reqid</tt> to distinguish concurrent operations.
Note that FENCE is locally blocking, and collective, but need not collectively synchronize.
== PMI Wire Protocol Version 2 ==
=== Specification Conventions ===
* Boolean values are always one of the strings "<code>true</code>" or "<code>false</code>".
* As mentioned previously, return codes (<code>rc</code>) are non-negative integers that may be represented by a <tt>int32_t</tt> integer in C (using a signed integer gives more flexibility in the calling code).
=== The protocol ===
The client establishes a connection to the server.  The client may choose to, and/or the server may require that the client, use a secure connection, such as OpenSSL.  This part of the wire protocol is not defined.  It is recommended that clients and servers either use OpenSSL or use plain sockets, making use of the authentication extensions defined below on the <tt>fullinit</tt> command.
In order to initialize the connection between the client and the server, the client needs to know how to establish the connection.  These may be specified by the following environment variables:
;PMI_FD: Use this FD (encoded as a character)
;PMI_PORT: Use this host and port, encoded as <tt>hostname:port-number</tt>, i.e., <tt>myhost.edu:12674</tt>
Once the socket connection is established, the client sends commands to the server in the following format:
where <tt>length</tt> is exactly 6 ASCII digits (blank padded on left if necessary) giving the length and the
command is exactly that many bytes (''not'' characters, since PMI-2 uses UTF-8 and there may be some multi-byte characters).  The <code>commandline</code> is of the form
  cmd=name;key1=value1;key2=value2 ...;
If a semicolon needs to be part of a key or value, it needs to be
escaped by being doubled. 
No other characters (not even new lines)
are special.  The semicolon was chosen because it is often used to
terminate commands, and hence is rarely used in the sort of values
that are likely to be communicated with PMI.
The first command the client can send to the server is the init command
Command names are more limited and must be letters, digits, and hyphens only.  Key names are limited to letters, digits, hyphens, and underscores.  To be precise, the regular expression for keys is
Note that values are not limited to those characters, and may contain spaces, equal signs, newlines, and even nulls ''(but not semicolons)''.
Implementations must be prepared to accept values that contain the <tt>=</tt> character.  See notes on the character set below.
Note that the semicolon is the command ''terminator'', not ''separator''.  This slightly simplifies processing of commands.
There are many predefined commands.  For most commands, there is a
response; in that case, the name of the command adds
<tt>-response</tt>.  Responses normally include a return code;
that is given by a key of <tt>rc</tt>.  A typical sequence is to send
and receive
For any commands that occur after MPI initialization, there is an
additional <tt>thrid</tt> field on both the command and the
response.  This is used to support thread safety, and was described in
more detail above.  This field should be the first field after the command.
In some cases, a value or a command may require an excessive number of characters.  Rather than require that the
PMI client and server support arbitrarily long commands, a command may be split into multiple messages by adding a
at the end and a
at the beginning of the continuation.  The <tt>id</tt> value is arbitrary, but should be sufficient to identify a stream of commands.  For example, simply using the process or thread id will usually be sufficient.
Many commands have a response form that includes an <tt>errmsg=string</tt> item.  If there is no error (<tt>rc=0</tt>), this field may be omitted.
The following commands and their arguments are defined.  These are the
ones needed to implement the PMI verison 2 API.  The first command is
the one sent by the client to the PMI server; the second (always a
<tt>-response</tt> form) is sent back to the client.  The key
<tt>rc</tt> is used consistently for a return code.  A zero return
code is success; non-zero is failure and are [[#PMI Wireprotocol Error Codes|documented below]]. 
Note that some commands have an instance-specific number
of values (such as info keys or arguments to a command that is to be
After the standard init with version string,
{{color|blue|('''balaji''': what's the difference between debugged and pmiverbose?)}}
The client process sends a request to begin the initialization by using the
<tt>fullinit</tt> command.  This has three optional keys:
This is a string that allows the PMI server to identify processes
that belong to the same parallel job.  In the simple PMI
implementation, if the processes are started by <tt>mpiexec</tt>,
the environment variable <tt>PMI_JOBID</tt> is set with this
string; the processes can check in using this string.  This is only
required if the server will require a way to identify the processes in
a job; if no <tt>PMI_JOBID</tt> is set, then the
<tt>pmijobid</tt> key is not required. 
Another use for <tt>pmijobid</tt> is for processes that are
started outside of the process management system but that still need
the PMI services, such as the information on the rank and size of the
job.  This may occur if, for example, a parallel debugger starts the processes.
The rank in <tt>MPI_COMM_WORLD</tt> of this process. This is
provided only if the process already knows this value; that may happen
if, for example, the processes are started by some outside system, or
if the environment variable <tt>PMI_RANK</tt> is set.
If true, then PMI will require thread ids in messages in order to provide thread safety.  If this false, the wire protocol need not provide thread id values.  This provides a modest optimization.
An authentication type; this is a string name that the client and
server agree upon.  See the example below for details.  This is
optional; if the server does not need authentication (for example, the
connection is provided through a pre-existing file descriptor), this
field is not required.
Information to be used to establish authentication.
If the job was spawned, the key <tt>spawner-jobid</tt> is given
with the job id of the spawner.  If the process was not spawned (e.g.,
created with <tt>mpiexec</tt>), then this key is not provided in
the <tt>fullinit-response</tt>. This <tt>jobid</tt> can be used in <tt>job-connect</tt>.
The <tt>authtype</tt> and <tt>authinfo</tt> strings are used
to allow the PMI client and server to negotiate an authorization
method.  In relatively secure environments, particularly ones with a
shared secret, this can use a challenge-response handshake (which will
take place before the <tt>fullinit-response</tt> command is
returned). In this case, the sequence looks something like this:
Client sends:
(authinfo in this case is not needed and thus not sent).
The server takes this and returns the following:
where "n" is a random integer.  Then the client forms a string by
concatenating the shared secret with n and then creating the sha-256 hash
of that value.  It sends to the server the command
where string is the sha-256 hash.  At this point, the server can decide if
it is willing to accept the client.  If so, it returns the
<tt>fullinit-response</tt> command; otherwise it closes the
connection (and may wish to log the failed attempt).
{{color|red|('''goodell''': we may want to use SHA-256 or something similar that doesn't yet have a known attack against it.  MD5 and SHA-1 have known problems.)}}
{{color|red|('''gropp''':I've switched MD5 to SHA-256)}}
For additional security, the random integer itself would be encrypted,
and some function of the integer would be used by the client (this is
what Kerberos does).
This is a simple example.  Note that in this case, the connection
itself is not secured or encrypted.  Other strategies may require
additional exchanges; this is easily accomplished by sending
<tt>cmd=auth-response</tt> between the client and server until one
or the other indicates completion of the handshake by sending the
<tt>auth-response-complete</tt> command.
There is no response for this command.
{{color|red|('''goodell''': these job commands refer to a job id with different key names in different commands (jobid/name/id), could we standardize on "jobid"?)}}
{{color|red|('''gropp''': I have attempted to standardize these as jobid.)}}
This is a complex command because there are many fields and
the command may take a long time to execute. 
A response will be sent when the operation completes; in a
multi-threaded environment, the PMI client implementation must allow
other threads to make PMI requests while waiting for the spawn request
to complete.
The command <tt>spawn</tt> and each <tt>spawn-cmd</tt> must be
sent consecutively (e.g., no other thread may access the PMI wire
until the entire command is sent).  However, once the
<tt>spawn-cmd</tt> is sent, other threads should be
allowed to make PMI calls.  However, the PMI implementation must be
prepared to accept a <tt>spawn-response</tt> command at
any time.
  ... one for each command (this is spawn multiple)
Jobid in the <tt>spawn-response</tt> command is the JobId
that may be used in <tt>PMI_Job_Connect</tt>.  This information is
may be needed by other processes in order to use <tt>PMI_KVS_Get</tt>.
The <tt>thrid</tt> for the <tt>spawn</tt> and
<tt>spawn-cmd</tt> commands must be the same (for the same spawn command).
The info keys include the ones defined in the MPI-2 specification
(Section 5.3.4) and in addition include these
A timeout limit, in milliseconds (integer)
A string containing credentials needed to start a job (this may
include user, charge group, and pass phrases).  Note that it is
essential that this be encrypted if any sensitive information,
such as passwords or pass phrases, are included.  The PMI
client and server should agree on the method for securing this
The number of anticipated threads per MPI Process (integer).  This
may be used by a resource manager in allocating processor resourses.
Note the use of <tt>infokey</tt>''i''<tt>=key</tt> and
<tt>infovalue</tt>''i''<tt>=value</tt> instead of
<tt>key=value</tt>.  This avoids both possible conflict with
predefined names in the commands and with key names that contain
special characters.
'''Question:''' Note that this supports the <tt>preput</tt> operations used in PMI
version 1.  We may want to change this to make that data arrive in the
<tt>fullinit</tt> command when the job is spawned.
'''Question''': Do we still need this, or is it unnecessary because the init
step could return the job id (though it does not yet)?
The <tt>kvscopy</tt> value is yes if the PMI client needs to help
the PMI servers share KVS information.  This is necessary when the PMI
servers for the jobs are different and have no way to connect.  In
this case, the commands <tt>kvsgetall</tt> and
<tt>kvsputall</tt> are also used (note that these do not have PMI
Client API equivalents).
After the <tt>kvs-getall</tt> command, the server returns the
number of KVS pairs with the
<tt>kvs-getall-paircount-response</tt> command.  Following that
is one <tt>kvs-getall-pair-response</tt> for each key in the KVS
space.  The <tt>kvs-getall-response</tt> is sent after
all key/value pairs are returned.
The client sends one <tt>kvs-putall-pair</tt> command for each of
the keyval pairs.  This approach helps ensure that individual messages
are of reasonable length, since the number of pairs will often be at
least as large as the number of processes in
This command does not need an <tt>thrid</tt> because it can only
be used when all processes are guaranteed to be able to issue a PMI
Fence operation.  Effectively, that can only happen before
<tt>MPI_Init</tt> returns, so there can only be one thread
processing these operations.  However, a <tt>thrid</tt> is
included to eliminate differences in handling commands.
Note that because of the requirements on <tt>PMI_KVS_Fence</tt>, <tt>PMI_KVS_Put</tt> cannot be used after <tt>MPI_Init</tt>, except possibly in collective functions on <tt>MPI_COMM_WORLD</tt>.
The <tt>jobid</tt> field is optional; if it is not given, then the
job id for this process is assumed.  The value of the
<tt>jobid</tt> value must be one of the connected jobids.
The <tt>srcid</tt> field is a hint indicating which process might have PUT the corresponding key value pair.  If the hint is incorrect, the server should still return the correct value (e.g., if process 1 performed a put of <tt>key=foo</tt> and <tt>value=bar</tt>, then later a get is performed with <tt>srcid=0</tt> and <tt>key=foo</tt>, the server should respond with <tt>flag=TRUE</tt> and <tt>value=bar</tt>).  If a negative value is passed in <tt>srcid</tt>, or if the field is omitted altogether, then the hint is ignored.  This field is optional.
The <tt>thrid</tt> isn't necessary in practice but is included to
simplify command handling.
Note that the PMI API defines some key names; others may be added.  If
the key is unknown or there is no associated value for that key, the
value of <tt>flag</tt> is <tt>false</tt>. If the
<tt>wait</tt> key is set with the value <tt>true</tt>, then
the server will not respond until the value becomes available (in that
case, <tt>flag</tt> will always be set to found).  If the job
exits before the value becomes available, the server will treat that
as any other unexpected termination of the job.
=== Out-of-Band Messaging ===
Together with synchronous communication between the MPI process and the PMI server, there can optionally be a separate connection for out-of-band messaging.
The PMI server expects at least one connect and a follow-on message which says "cmd=init pmi_version=..." from the client. This is the "regular" connection.
The client can optionally open a second connection and a follow-on message which says "cmd=init_oob signal=SIGSTOP pmi_version=...". This will be the "out-of-band" connection. The signal can be one of the signals supported on that platform or NONE. If the signal requested is NONE, the server will not signal the MPI process when a message is sent. Note that if the application uses the same signal for its own processing, then it can be a problem. But that part is out-of-scope for this proposal; the MPI process can figure out what signal to use based on some coordination with the application (e.g., environment variable).
All out-of-band communication happens on the out-of-band socket. Each OoB message is initiated by the server and sent to the MPI process and is of the form "cmd=checkpoint ..." or "cmd=abort ...".
The MPI process can either request for a signal during the initialization (in which case the PMI server has to send the message and follow it up with a signal) or request for no signal (in which case the MPI process has to continuously monitor this socket using either a SIGIO or a separate thread blocking on the socket).
===Character Set===
If PMI was used solely for commands, any simple character set, such as ASCII, would be fine.  However, some commands may return error messages and others, such as the name publishing routines, may need user-specified strings.  To avoid problems with internationalization, PMI v2 uses [http://en.wikipedia.org/wiki/UTF-8 UTF-8], which provides backward compatibility to ASCII.  In particular, PMIv1, which used ASCII, used an UTF-8 subset.  One feature of UTF-8 is that bytes that represent ASCII characters are unique - all other characters have at least the high bit set.  This means that a character, such as the semicolon that PMIv2 uses as the terminator, can be found without worrying about whether the same byte is part of a longer UTF-8 multi-byte character - that can never happen.  In particular, the PMIv2 code can simply copy message strings that use multibyte UTF-8 without needing to process them.  More information on using UTF-8 may be found at http://www.cl.cam.ac.uk/~mgk25/unicode.html.
===PMI Wireprotocol Error Codes===
This is a non-exhaustive list of error codes from the PMI server.
These cannot be MPI error codes because the PMI server is independent
of any particular instance of MPI (even of MPICH).  Error codes are
Also 0, means no error
Communication failure with PMI server
Unrecognized command
== Discussion Items ==
An easy way to provide security for the PMI traffic is to use
OpenSSL.  Because this changes the way in which the socket is connected, we've added a <tt>PMI_SOCKTYPE</tt> environment variable.  Note that a PMI implementation is free to require SSL or some other secure communication mechanism.
The length-message format use here is easy on the reader but can be
awkward for the writer, particularly for the commands that may have
large number of key/value pairs (such as spawn with many command-line
arguments and info values).  For lines that exceed this maximum length, there is a special operation to concatenate lines.  PMI implementations may choose to bound the total size of a command (e.g., to be limited by the maximum command line in the typical shell, which is often around 64k).
Here are some thoughts for how to implement processing of the PMI wire protocol in the multithreaded case, particularly on the client side (where multiple threads may make blocking PMI calls).
Sending (from the client) is easy.  We may assume that the server is working as fast as possible to process requests, so any PMI call can enter a PMI-write critical section, perform the write, then exit the critical section.  If the write blocks, that's ok, as the server will soon unblock it, and no other PMI call that needs to write would be able to make progress.
Receiving is more difficult.
Define a routine, <code>PMIR_Progress( int mythrid )</code>, that reads from the PMI fd and processes each message.  If the message is for the specified <code>mythrid</code>, then exit after that message is read.  Otherwise, read the message and enqueue it by <code>thrid</code>; signal (using a condition variable) the relevant thread.  Note that because all of the PMI calls are blocking, at most one message per thread will be pending.  Thus, copying out and enqueuing the message for each thread is a small burden and simplifies the implementation.
If a routine enters <code>PMIR_Progress</code> and some routine is already using <code>PMIR_Progress</code>, then enter a condition wait.
Keyword: PMI

Latest revision as of 21:01, 16 June 2014

For the formal specification of PMI-2.

See PMI v2 API for some discussions about possible designs and some issues.

Design Requirements

(WDG - I find it valuable to list objectives and requirements first. Here's an initial list. They can and should be expanded, and the consequences of each understood)

  • Scalable - Semantics of operations must permit scalable implementation
  • Efficient - Must provide MPI implementation with the information that it needs without requiring potentially expensive steps.
  • Complete - Must support all of MPI, including dynamic processes
  • Robust - Must handle failures and aborts, including any resources acquired by the MPI application.
  • Correct - Must avoid race conditions in the design
  • Portable - Must not assume a particular environment such as POSIX