Difference between revisions of "New RMA Design"

From Mpich
Jump to: navigation, search
(Created page with "Category:Design Documents")
 
Line 1: Line 1:
 
[[Category:Design Documents]]
 
[[Category:Design Documents]]
 +
 +
== New data structure for RMA operations ==
 +
 +
(1) Overview:
 +
 +
We use a new 3D data structure to store posted RMA operations. There are three kinds of data structures involved:
 +
(a) RMA op: contains all origin/target information needed for this RMA op, plus a new area being set by either “pending” (op is posted by user but not issued by runtime yet) or “issued” (op is issued by runtime but not completed yet). When user posts a new op, runtime creates and enqueues a new op structure to the corresponding op list; when this op is completed, runtime dequeues and frees the op structure from the list.
 +
(b) Target: contains pointer to an op list that stores all RMA ops to the same target, and PER_TARGET state for this target (see bullet 6(1)). When the origin first talks to one target, runtime creates and enqueues a new target structure to the corresponding target list; When the origin finishes the communication with that target, or all internal resources for targets are used up (see bullet 4), runtime dequeues and frees the target structure from the list.
 +
(c) Slot: contains pointer to a target list. Distribution of targets among slots follows the round-robin rule. During window creation time, MPI runtime allocates a slot array with fixed size on the window (size of slot array can be changed by the user).
 +
[[File:Op-list-slots.jpg]]
 +
 +
(2) Performance issue:
 +
 +
Note that for every RMA op routine, runtime needs to search in the corresponding target list to find the correct op list, which may introduce some overhead for posting operations. However, with fixed size of slot array, the overhead of looking up is linear with number of targets the origin is actively communicating with. In other words, this will cause significant overhead only when the application is unscalable and talking to many targets at one time. Therefore, it is not our responsibility to optimize the performance in such case. For a scalable application, the overhead of looking up when posting operations is trivial.
 +
 +
(3) Potential benefits:
 +
 +
(a) Separate list for each target
 +
(b) Scalable
 +
 +
== Operations pool ==
 +
 +
== Garbage collection for operations ==
 +
 +
== Targets pool ==
 +
 +
== Garbage collection for targets ==
 +
 +
== Algorithms for each synchronization ==
 +
 +
(1) States
 +
(2) Fence
 +
(3) Post-Start-Complete-Wait
 +
(4) Lock-Unlock
 +
(5) Lock_all-Unlock_all
 +
(6) Flush
 +
 +
== Multithreading issues ==
 +
 +
== Shared memory issues ==

Revision as of 15:40, 14 July 2014


New data structure for RMA operations

(1) Overview:

We use a new 3D data structure to store posted RMA operations. There are three kinds of data structures involved: (a) RMA op: contains all origin/target information needed for this RMA op, plus a new area being set by either “pending” (op is posted by user but not issued by runtime yet) or “issued” (op is issued by runtime but not completed yet). When user posts a new op, runtime creates and enqueues a new op structure to the corresponding op list; when this op is completed, runtime dequeues and frees the op structure from the list. (b) Target: contains pointer to an op list that stores all RMA ops to the same target, and PER_TARGET state for this target (see bullet 6(1)). When the origin first talks to one target, runtime creates and enqueues a new target structure to the corresponding target list; When the origin finishes the communication with that target, or all internal resources for targets are used up (see bullet 4), runtime dequeues and frees the target structure from the list. (c) Slot: contains pointer to a target list. Distribution of targets among slots follows the round-robin rule. During window creation time, MPI runtime allocates a slot array with fixed size on the window (size of slot array can be changed by the user). Op-list-slots.jpg

(2) Performance issue:

Note that for every RMA op routine, runtime needs to search in the corresponding target list to find the correct op list, which may introduce some overhead for posting operations. However, with fixed size of slot array, the overhead of looking up is linear with number of targets the origin is actively communicating with. In other words, this will cause significant overhead only when the application is unscalable and talking to many targets at one time. Therefore, it is not our responsibility to optimize the performance in such case. For a scalable application, the overhead of looking up when posting operations is trivial.

(3) Potential benefits:

(a) Separate list for each target (b) Scalable

Operations pool

Garbage collection for operations

Targets pool

Garbage collection for targets

Algorithms for each synchronization

(1) States (2) Fence (3) Post-Start-Complete-Wait (4) Lock-Unlock (5) Lock_all-Unlock_all (6) Flush

Multithreading issues

Shared memory issues