Difference between revisions of "New RMA Design"

From Mpich
Jump to: navigation, search
(Algorithms for RMA synchronizations)
Line 1: Line 1:
 
[[Category:Design Documents]]
 
[[Category:Design Documents]]
  
== New data structure for RMA operations ==
+
== Goals ==
 +
 
 +
The goals of the RMA infrastructure are as follows:
 +
 
 +
# No usage of O(P) data structures, where 'P' is the number of processes.  Data structure sizes should be constant and user-configurable.  General rule is that the user should be able to use more memory to achieve better performance.
 +
# Reduce redundant messages where possible (e.g., extra acknowledgment packets).
 +
# Reduce process synchronization where possible (e.g., FENCE need not always be synchronizing).
 +
# Ensure code is thread-safe (e.g., avoid thread-private data structures).
 +
# Ensure integration of RMA code into the overall MPICH progress engine.
 +
 
 +
 
 +
== Data structures ==
  
 
'''1. Overview''':  
 
'''1. Overview''':  
  
We use a new data structure, which we call '''Operation Table''', to store posted RMA operations.
+
We use a new 3D data structure, which we call '''Operation Table''', to store posted RMA operations.
  
 
[[File:Op-list-slots.jpg]]
 
[[File:Op-list-slots.jpg]]

Revision as of 19:26, 15 July 2014


Goals

The goals of the RMA infrastructure are as follows:

  1. No usage of O(P) data structures, where 'P' is the number of processes. Data structure sizes should be constant and user-configurable. General rule is that the user should be able to use more memory to achieve better performance.
  2. Reduce redundant messages where possible (e.g., extra acknowledgment packets).
  3. Reduce process synchronization where possible (e.g., FENCE need not always be synchronizing).
  4. Ensure code is thread-safe (e.g., avoid thread-private data structures).
  5. Ensure integration of RMA code into the overall MPICH progress engine.


Data structures

1. Overview:

We use a new 3D data structure, which we call Operation Table, to store posted RMA operations.

Op-list-slots.jpg

There are three kinds of data structures involved:

(1) Operation Element: contains all origin/target information needed for this RMA operation, plus a new area, op_state, which indicates whether this operation is posted by user but not issued by runtime yet, or it is issued by runtime but not completed yet. When user posts a new operation, the runtime creates and enqueues a new operation structure to the corresponding operation list; when this operation is completed, the runtime dequeues and frees the operation structure from the list.

 struct OP_ELEM {
   all OP informations;
   struct OP_ELEM * prev, * next;
   int op_state;  // PENDING/ISSUED
 }

(2) Target Element: contains pointer to an operation list that stores all RMA operations to the same target, and PER_TARGET state for this target. When the origin first talks to one target, runtime creates and enqueues a new target structure to the corresponding target list; When the origin finishes the communication with that target, or all internal resources for targets are used up, runtime dequeues and frees the target structure from the target list.

 struct TARGET_ELEM {
   struct OP_ELEM * ops_list;
   stuct TARGET_ELEM * prev, * next;
   int rank;
   int ops_count;
   int target_state; // LOCK_CALLED/LOCK_ISSUED/LOCK_GRANTED 
   int target_type; // SHARED/EXCLUSIVE
   int target_mode; // MODE_NO_CHECK
 }

(3) Slot: contains pointer to a target list. Distribution of targets among slots follows the round-robin rule. During window creation time, MPI runtime allocates a slot array with fixed size on the window (size of slot array can be changed by the user).

 struct SLOT {
   struct TARGET_ELEM * targets_list;
 }

Note that there is a variable on window indicating how many Slots have non-NULL target list. It helps optimizing ISSUE_OPS and CLEANUP_OPS (see Algorithms for RMA synchronizations), by avoiding traversing all Slots.

2. Performance issue

Note that for every RMA operation routine, runtime needs to search in the corresponding target list to find the correct target, which may introduce some overhead when posting operations. However, with fixed size of slot array, the overhead of looking up is linear with number of targets the origin is actively communicating with. In other words, this will cause significant overhead only when the application is unscalable and talking to many targets at one time. Therefore, it is not runtime's responsibility to optimize the performance in such case. For a scalable application, the overhead of looking up when posting operations is trivial.

3. Potential benefits

(1) Operations for different targets are not mixed in one list anymore, but are stored in separate operation list for each target. This brings many benefits. For example, for single target epoch, we can judge if all operations to one target are completed easily, and we can find the last operation to one target easily without counting the total number of operations; for multiple targets epoch, we can optimize the garbage collection function by stopping searching in the current target's operation list if we meet incompleted operations (it is high possibility that following operations in this list are also not completed) and jump to other target's operation list.

(2) Scalability: size of slots array is fixed and set by the user. It is no longer O(p) and is scalable when handling large number of processes on the window.

(3) Garbage collection function can easily distinguish between operations that are not issued and operations that are issued but not completed yet, by checking state flag.

Operation Elements Pool and Target Elements Pool

We use a two-level design for Operation Elements Pool to manage Operation Element resources. One Operation Elements Pool is allocated and attached per window and which we call "local pool", it is used only by that window; another Operation Elements Pool is a "global pool" and can be used by all windows. Sizes of both pools can be set by user. The local pool is proposed to avoid the situation that one window uses up all resources in global pool and makes other windows have no resources at all; the global pool is proposed for the case when there are many windows in the application, in such case the local pool to each window cannot be large.

When we try to create a new Operation Element, we first check if local pool is available, if not, we will check if global pool is available, if not too, we will call RESTORE_OP_ELEM_RESOURCES (see Algorithms for RMA synchronizations).

The Target Elements Pool is roughly the same with Operation Elements Pool. Only difference may be that the size of Target Elements pool is much smaller than Operation Elements Pool.

Algorithms for RMA synchronizations

1. States for different epochs

There are 12 states for different RMA epochs: UNSET, FENCE_ISSUED, FENCE_GRANTED, PSCW_ISSUED, PSCW_GRANTED, LOCK_ALL_CALLED, LOCK_ALL_ISSUED, LOCK_ALL_GRANTED, PER_TARGET, LOCK_CALLED, LOCK_ISSUED, LOCK_GRANTED. The last three states are per-target states (stored in Target Element), the rest states are global states (stored on Window).

All states are mutually exclusive with each other.

2. Important basic routines

(1) ISSUE_OPS: it is a nonblocking call. It tries to issue all pending operations for a certain target as many as possible. It returns number of operations it issued.

(2) CLEANUP_OPS: it is a nonblocking call. It tries to find completed operations and clean them up for a certain target as many as possible. It returns number of operations it cleaned up.

(3) MAKE_RMA_PROGRESS: it is a nonblocking call. It tries to issue pending operations and clean up issued operations for a certain target as many as possible.

 MAKE_RMA_PROGRESS (int target, MPI_Win win) {
   do {
     x = ISSUE_OPS(int target, MPI_Win win);
     y = CLEANUP_OPS(int target, MPI_Win win);
   } while (x+y != 0);
 }

Note that MAKE_RMA_PROGRESS is called whenever we want to make progress on RMA operations. Ideally it should be called from progress engine.

(4) PROGRESS_WAIT: it is a blocking call. It keeps poking progress engine until a completion signal is caught in the progress engine. Note that for multithreading case, one thread which is blocked in the function may yield CPU to other threads.

(5) RESTORE_TARGET_ELEM_RESOURCES: it is a blocking call. It tries to make progress and poke progress engine until there are Target Element resources available.

 RESTORE_TARGET_ELEM_RESOURCES() {
   if (global_state == FENCE_GRANTED || global_state == PSCW_GRANTED) {
     pick one Target Element K;
     do MAKE_RMA_PROGRESS and PROGRESS_WAIT until K’s operation list is empty;
     send (or piggyback) RMA_DONE packet to K (increment ack_counter);   // don't wait for RMA_ACK to arrive
     free Target Element K;
   }
   else if (global_state == PER_TARGET) {
     if (no Target Element with state LOCK_GRANTED) {
       if (no Target Element with state LOCK_ISSUED) {
         pick one Target Element K with LOCK_CALLED;
         send lock request message;
         set state to LOCK_ISSUED;
       }
       else
         pick one Target Element K with state LOCK_ISSUED;
       PROGRESS_WAIT for lock granted message;
       set state to LOCK_GRANTED;
     }
     else {
       pick one Target Element K with state is LOCK_GRANTED;
     }
     do MAKE_RMA_PROGRESS and PROGRESS_WAIT until K’s operation list is empty;
     send (or piggyback) RMA_DONE packet to K;
     do PROGRESS_WAIT until RMA_ACK packet is arrive;
     free Target Element K;
   }
 }

Note that for FENCE_GRANTED and PSCW_GRANTED, we send RMA_DONE packets but do not need to wait for RMA_ACK packets in RESTORE_TARGET_ELEM_RESOURCES, because we can just track total number of RMA_DONE packet and wait for all of them in closing synchronization calls, however, we cannot do the same thing for PER_TARGET, because we need an potentially O(P) array to track number of RMA_DONE packets to each target.

(6) RESTORE_OP_ELEM_RESOURCES: it is a blocking call. It tries to make progress and poke progress engine until there are Operation Element resources available.

 RESTORE_OP_ELEM_RESOURCES {
   do MAKE_RMA_PROGRESS and PROGRESS_WAIT until there are available Operation Element resources;
 }

3. Algorithm for FENCE

Fence-states.jpg Legends.jpg

Note that there is a request pointer on window called fence_req which stores request of IBARRIER.

 FENCE() {
   if (win_ptr->fence_req != NULL) {
     if (MPI_MODE_NOPRECEDE) {
       decrement ref count of fence_req by 1;  // let progress engine to delete this request
       set win_ptr->fence_req to NULL;
     }
     else {
       do PROGRESS_WAIT until previous IBARRIER is completed;  // fence_req on window is deleted now
       set global state to FENCE_GRANTED;
     }
   }
   if (MPI_MODE_NOPRECEDE) { // very first FENCE
     if (SHM is allocated) perform a local BARRIER;
     do a global IBARRIER, store the request to fence_req on window;
     set global state to FENCE_ISSUED;
   }
   else { // some operations needs to be completed
     do MAKE_RMA_PROGRESS to issue all operations in Operation Table;
     issue (or piggyback) RMA_DONE packets to all active targets (increment ack_counter);
     do PROGRESS_WAIT until Operation Table is empty; // all operations are completed;
     do PROGRESS_WAIT until I received RMA_ACK packets from all active targets (decrement ack_counter, wait until it reaches 0);
     do a global IBARRER;
     set global state to FENCE_ISSUED; // I am done with all my outgoing messages
     do PROGRESS_WAIT until IBARRIER is completed;  // fence_req on window is freed now
     set global state to FENCE_GRANTED; // I know that everyone else is done
   }
   if (MPI_MODE_NOSUCCEED) { // very last FENCE
     if (global state is FENCE_ISSUED) {  // fence_req on window is not freed
       do PROGRESS_WAIT until IBARRIER is completed;  // fence_req on window is freed now
       set global state to UNSET;
     }
   }
 }
 RMA_OP() {
   if (global state is FENCE_ISSUED) {
     if (request of IBARRIER is completed) {
       set global state to FENCE_GRANTED;	
     }
     else {
       // queue up this operation
       if (Target Element resources available && Operation Element resources available) {
         create Target Element/Operation Element, queue it up in Operation Table;
         exit;
       }
       else {
         if (run out of Target Element resources) {
           do PROGRESS_WAIT until IBARRIER is completed OR there are Target Element resources available; // other window may free resources
           if (Target Element resources available) {
             create Target Element, queue it up in Operation Table;
             if (Operation Element resources available) {
               create Operation Element, queue it up in Operation Table;
               exit;
             }
           }
           else { // request of IBARRIER is completed
             set global state to FENCE_GRANTED;
           }
         }
         if (run out of Operation Element resources) {
           do PROGRESS_WAIT until IBARRIER is completed OR there are Operation Element resources available; // other window may free resources
           if (Operation Element resources available) {
             create Operation Element, queue it up in Operation Table;
             exit;
           }
           else { // request of IBARRIER is completed
             set global state to FENCE_GRANTED;
           }
         }
       }
     }
   }
   if (global state is FENCE_GRANTED) {
     if (# of queued operations to this target <= piggyback_threshold) {
       // do piggyback, queue up this operation
       if (run out of Target Element resources) {
         do RESTORE_TARGET_ELEM_RESOURCES;
         create Target Element, queue it up in Operation Table;
       }
       if (run out of Operation Element resources)
         do RESTORE_OP_ELEM_RESOURCES;
       create Operation Element, queue it up in Operation Table;
     }
     else {
       // don’t do piggyback
       issue all queued Operation Elements to this target;
     }
   }
 }

Note that if the very last FENCE is not called with MPI_MODE_NO_SUCCEED, the global state is still FENCE_GRANTED, other RMA synchronizations will directly transit the global state to correct states; if the very last FENCE is called with MPI_MODE_NO_PRECEDE, the global state is still FENCE_ISSUED, other RMA synchronizations will also directly transit the global state to correct states, however, fence_req on window is not freed yet, which will be freed either when we meet another MPI_WIN_FENCE or when we reach MPI_WIN_FREE.

4. Algorithm for Post-Start-Complete-Wait

Pscw-states.jpg Legends.jpg

The algorithm for Post-Start-Complete-Wait is roughly the same with the algorithm for FENCE, except for the following points:

(1) Every places we do a IBARRIER in FENCE algorithm, we do ISENDs (on target) and IRECVs (on origin) among group of processes instead.

(2) In WIN_POST, if SHM is allocated, the targets wait until all local ISENDs are completed; similarly, in WIN_START, if SHM is allocated, the origins wait until all local IRECVs are completed.

(3) Every places we waits for the request of IBARRIER to be completed in FENCE algorithm, we waits for an array of requests (from ISENDs/IRECVs) to be completed instead. We allocated the request array on window. This is unscalable if origin has many active targets concurrently. FENCE is a better choice for such applications.

5. Algorithm for Lock-Unlock

Lock-states.jpg Legends.jpg

 LOCK() {
   set global state to PER_TARGET;
   if (target_rank == my_rank) {
     do PROGRESS_WAIT until I got the lock;
     exit;
   }
   if (SHM is allocated && target is a local process) {
     send lock request message;
     do PROGRESS_WAIT until receiving lock granted message;
     exit;
   }
   // create Target Element for this target
   if (run out of Target Element resources)
     do RESTORE_TARGET_ELEM_RESOURCES;
   create Target Element, queue it up in Operation Table;
   if (MPI_MODE_NOCHECK)
     set per-target state to LOCK_GRANTED;
   else
     set per-target state to LOCK_CALLED;
 }
 RMA_OP() {
   if (global state is PER_TARGET) {
     if (Target Element for this target does not exist) {
       // create Target Element for this target
       if (run out of Target Element resources)
         do RESTORE_TARGET_ELEM_RESOURCES;
       create Target Element, queue it up in Operation Table;
       set per-target state to LOCK_GRANTED;
     }
     // now Target Element for this target exists in Operation Table
     if (per-target state is LOCK_CALLED) {
       if (ops_list is empty && basic_datatype && size <= single_op_opt_threshold) {
         // queue this operation up for single operation optimization
         if (run out of Operation Element resources)
           do RESTORE_OP_ELEM_RESOURCES;
         create Operation Element, queue it up in Operation Table;
         exit;
       }
       else {
         send lock request message;
         set per-target state to LOCK_ISSUED;
         // queue this operation up
         if (run out of Operation Element resources)
           do RESTORE_OP_ELEM_RESOURCES;
         create Operation Element, queue it up in Operation Table;
         exit;
       }
     }
     else if (per-target state is LOCK_ISSUED) {
       // queue this operation up
       if (run out of Operation Element resources)
         do RESTORE_OP_ELEM_RESOURCES;
       create Operation Element, queue it up in Operation Table;
       exit;
     }
     else {  // per-target state is LOCK_GRANTED
       if (# of queued operations to this target <= piggyback_threshold) {
         // do piggyback, queue up this operation
         if (run out of Operation Element resources)
           do RESTORE_OP_ELEM_RESOURCES;
         create Operation Element, queue it up in Operation Table;
         exit;
       }
       else {
         // don’t do piggyback
         issue all queued Operation Elements to this target;
         exit;
       }
     }
   }
 }
 UNLOCK() {
   if (Target Element for this target does not exist) {
      // create Target Element for this target
      if (run out of Target Element resources)
        do RESTORE_TARGET_ELEM_RESOURCES;
      create Target Element, queue it up in Operation Table;
      set per-target state to LOCK_GRANTED;
   }
   // now Target Element for this target exists in Operation Table
   if (global state is LOCK_CALLED) {
     // single operation optimization
     issue LOCK-OP-UNLOCK to this target;
     do PROGRESS_WAIT until RMA_ACK is received;
     free this Target Element;
     set global state to UNSET;
     exit;
   }
   if (global state is LOCK_ISSUED) {
     do PROGRESS_WAIT until lock granted message is received;
     set per-target state to LOCK_GRANTED;
   }
   if (global state is LOCK_GRANTED) {
     do MAKE_RMA_PROGRESS to issue all operations in Operation Table;
     issue (or piggyback) RMA_DONE+UNLOCK packet to target;
     do PROGRESS_WAIT until operation list for this target is empty; // all operations are completed;
     do PROGRESS_WAIT until I received RMA_ACK packet from target;
     free this Target Element;
     set global state to UNSET;
   }
 }

Here Target Element may be initialized in two situations: either in MPI_WIN_LOCK or in operation routines where we cannot find the corresponding Target Element in Operation Table. For the first situation, the Target Element is created with state LOCK_CALLED; for the second situation, the Target Element is created with state LOCK_GRANTED.

Note that when Target Element is initialized in operation routines, we lose the lock type (SHARED/EXCLUSIVE) and lock mode (MODE_NO_CHECK) for this target. For lock type (SHARED/EXCLUSIVE), it is OK, because we don't need it for LOCK_GRANTED state; for lock mode (MODE_NO_CHECK), it is not quite good. For now we just send UNLOCK message anyway, if UNLOCK packet handler cannot find corresponding origin in lock queue (which means this LOCK is with MODE_NO_CHECK), it will just ignore this UNLOCK message.

Note that we don't free Target Element if we issue and complete all operations to this target; we free Target Element only when we received RMA_ACK packet from that target.

6. Algorithm for Lock_all-Unlock_all

Lock-all-states.jpg Legends.jpg

 LOCK_ALL() {
   set global state to LOCK_ALL_CALLED;
 }
 RMA_OPS() {
   if (global state is LOCK_ALL_CALLED) {
     
   }
   if (global state is LOCK_ALL_ISSUED) {
   }
   if (global state is LOCK_ALL_GRANTED) {
   }
 }

7. Algorithm for Flush

RMA + threads

Situations needs to be considered are:

1. PUT/GET/ACC + PUT/GET/ACC

2. PUT/GET/ACC + FENCE without MODE_NOSUCCEED

3. PUT/GET/ACC + FLUSH

4. FLUSH + FLUSH

Note that PUT/GET/ACC + FENCE with MODE_NOSUCCEED, PUT/GET/ACC + COMPLETE and PUT/GET/ACC + UNLOCK/UNLOCK_ALL are not correct programs.

RMA + shared memory

See comments "Notes for memory barriers in RMA synchronizations" in src/mpid/ch3/src/ch3u_rma_sync.c.