DAME: A Data Manipulation Engine for MPI
DAME is an efficient representation of the data movements defined by an MPI datatype. The representation may be optimized; that is, it may be different from the user's definition of the datatype; for example, a struct of different datatypes may be replaced by a contiguous block. Think of it as a simple program that can be executed efficiently by the MPI implementation.
In the most simple implementation, an interpreter, using code that is optimized for the platform, executes the DAME description. An additional optimization may generate native object code at runtime that is executed directly.
Issues and Implications:
1. An MPI datatype may specify a large amount of data, larger than can be sent in a single message. Thus the datatype implementation must handle "partial messages". However, it is not necessary to allow arbitrary buffer sizes (as in very small) nor is it necessary to require that the buffer be exactly filled (need not break an int across two buffers). An important special case is that where everything fits within a single buffer. In this case, there should be no extra overhead.
Question: If there was a single, known-sized maximum buffer length, the DAME "program" could be arranged at datatype commit time for that size, simplifying the code for handling longer buffers. Is this desirable?
Another consequence of this is that it is important to have a compact way to describe the state of a partial move, and to be able to continue from the last position.
2. Despite claims to the contrary, memcpy is *not* always faster than a simple loop - anyone who claims otherwise should replace all "i=j" with "memcpy(&i,&j,sizeof(int))" in their code. Thus, it is important to avoid assuming that all block moves are long.
An implication of this is that it may make sense to have different "instructions" for short and not-short block moves within the simple instruction set. Similarly, when alignment is easily computed and preserved by the move, that may need to be a separate instruction.
3. Large datatypes and efficiency. It is essential to be able to handle data larger than 2/4 GBytes. However, arithmetic with 8 byte integers may be slower (and in the case of indexed datatypes, double the memory pressure in the datatype description). A careful handling of sizes and counts is needed to ensure both efficiency in the common (less than 2GByte) case and correctness when the data is larger in size.
One possible mechanism is to have short (4 byte) and long (8 byte) versions of the data move instructions, permitting the optimization of the move code.
4. It is common for MPI datatypes to describe large collections of relatively small blocks. For good performance, there must be nearly zero overhead in dealing with the blocks. This means that the code for executing the data moves must be very streamlined, with little runtime overhead. Tests, as much as possible, must be hoisted out of the execution of the moves.
5. Error checking. The current code does a lot of checking at runtime, everytime the dataloop is used. In many cases, these may be checked at Type_commit time, or used only in debugging modes. This is related to point 4.
Define a data move language that matches both the MPI datatypes and the capabilities of the hardware. A DAME is a representation of the moves defined by an MPI datatype in this language. Optimizations can be applied to this representation to produce equivalent moves but with higher performance (the current dataloop_optimize performs simple optimizations). For most MPI datatypes (all except complex ones involving STRUCT), the "program" can be represented as an stack, with each stack element preloaded with the move "instruction". Storing a partial move requires only remember the position in the stack and the progress within that instruction.
For the move code itself, a simple interpreter that implements a loop over items, and where the loop has been written (possibly by an autotuning tool) to provide good performance, should usually be sufficient to get good performance. For more general cases, runtime generation of native code is possible. For important special cases, we could optimize for those cases.
Finally, given this "DAME program", it should be possible to take advantage of hardware scatter/gather support.
DAME design principles
The DAME design follows these principles:
1. The DAME data manipulation should be able to take advantage of the types and alignments of the lowest-level datatype elements. For example, if the derived datatype only consists of integers, we know at type commit time that all data movement operations can be done as assignments of integers if so desired.
2. All datatype manipulation operations (packing/unpacking) should be done in as many tight loops as possible. For example, for simple datatypes like vector-of-vector-of-integers, the resultant code should be essentially a three-level for loop at the most (with the lowest level for loop doing integer-wise assignments instead of memcpy for small contiguous regions). Many of the for loop structures for simple datatypes can be generated at compile time. For complex datatypes, a slower path might be OK.
To be added.