The MPICH ADI allows devices to provide customized support for the MPI RMA operations. The CH3 Channel provides a default implementation that relies only on the CH3 operations, along with provisions for channel extensions. In addition, the CH3 RMA implementation contains features to minimize the number of messages used for synchronization (see (EuroMPI paper)).
A redesign and extension of the CH3 RMA implementation will be (is) described below. This will optimize for both short (latency-bound) and long (bandwidth bound) RMA operations.
Additional details of the MPI-3 RMA development process are available at the MPI-3 RMA Implementation Timeline.
The RMA design has many competing requirements. The obvious top-level requirements are
- Full MPI-3 support, including new creation routines, optional ordering, flush, and new read-modify-write operations
- High performance for latency-bound operations (e.g., single word put, accumulate, or get)
- High performance for bandwidth-bound operations (e.g., multiple puts of 10's of KB of data).
- Exploit available hardware support, including shared memory and networks supporting RDMA.
- Scalable algorithms and data; in particular, MPI Window data needs to be scalable (can also support the new MPI-3 window creation routines).
- MPI-2 and MPI-3 synchronization options.
- Extendable to different hardware systems
High performance for latency-bound operations requires both that there be a short code path for these and that network transactions be minimized. In turn, this implies that there is specialization for such things as contiguous datatypes and combined lock/operation/unlock for passive target. It also means following the principle that decisions are made once; for example, once it is determined that a transfer is contiguous, that shouldn't be tested again, and that the number of data copies should be minimized.
High performance for both bandwidth-bound and large numbers of short operations requires that these operations be initiated as early as possible. MPI-3 (through the new request interface) requires that the user be able to individually wait on these operations. Datatype caching should be performed as well for non-contiguous datatypes.
Support for fast operation within an SMP will require that the SMP path, like in the nemesis channel, is given highest priority.
Evaluation of the current MPICH RMA Support
The implementation for MPI-2 RMA in MPICH is the responsibility of the device; in the ch3 device, the current support uses the channel communication functions, along with the receive handler functions, to implement RMA. This provides a two-sided implementation that does not provide a good route to accessing lower-level interconnect features. In addition, the implementation was designed to support the general case, with only a few optimizations added much later to support a few special cases (particularly for short accumulates). Also, the approach uses lazy synchronization, which provides better performance for latency-bound operations and for short groups of RMA operations but does not provide good support for bandwidth bound operations or for communication/computation overlap. Good points include general correctness, fairly detailed internal instrumentation.
Consider designing for multicore SMP nodes. This means that some data is stored within shared memory, and could include MPI_Win data, cached datatypes, and for the new shared-memory windows, the lock state. We should also consider matching the RMA operations with the number of communication channels; thus the processing of RMA operations might not be conducted by the same cores or same number of cores (e.g., on BG/Q, should this be handled by the 17th core?).
To compete with PGAS languages, the code paths must be very short, with optimizations for the latency-bound cases.