Difference between revisions of "New RMA Design"

From Mpich
Jump to: navigation, search
(Basic routines)
(Basic routines)
Line 124: Line 124:
 
    
 
    
 
     else if (window_state == PER_TARGET) {
 
     else if (window_state == PER_TARGET) {
       if (no Target Element with state LOCK_GRANTED) {
+
       pick one Target Element K in priority order LOCK_GRANTED, LOCK_ISSUED, LOCK_CALLED;
        if (no Target Element with state LOCK_ISSUED) {
 
          pick one Target Element K with LOCK_CALLED;
 
          send lock request message;
 
          set state to LOCK_ISSUED;
 
        }
 
        else {
 
          pick one Target Element K with state LOCK_ISSUED;
 
        }
 
        PROGRESS_WAIT for lock granted message;
 
        set state to LOCK_GRANTED;
 
      }
 
      else {
 
        pick one Target Element K with state is LOCK_GRANTED;
 
      }
 
 
       do MAKE_RMA_PROGRESS and PROGRESS_WAIT until K’s operation list is empty;
 
       do MAKE_RMA_PROGRESS and PROGRESS_WAIT until K’s operation list is empty;
       send (or piggyback) RMA_DONE packet to K;
+
       send (or piggyback) RMA_DONE packet to K;   // don't wait for RMA_ACK to arrive
       do PROGRESS_WAIT until RMA_ACK packet is arrive;
+
       win_ptr->ack_count++;
 
       free Target Element K;
 
       free Target Element K;
 
     }
 
     }

Revision as of 02:50, 16 July 2014


Goals

The goals of the RMA infrastructure are as follows:

  1. No usage of O(P) data structures, where 'P' is the number of processes. Data structure sizes should be constant and user-configurable. General rule is that the user should be able to use more memory to achieve better performance.
  2. Reduce redundant messages where possible (e.g., extra acknowledgment packets).
  3. Reduce process synchronization where possible (e.g., FENCE need not always be synchronizing).
  4. Ensure code is thread-safe (e.g., avoid thread-private data structures).
  5. Ensure integration of RMA code into the overall MPICH progress engine.


Data structures

1. Overview:

We use a new 3D data structure, Operation Table, to store posted RMA operations.

Op-list-slots.jpg

There are three kinds of data structures involved:

(1) Operation Element: contains all origin/target information needed for this RMA operation, plus a request handle that is set when the operation is issued (and has not completed immediately). When the user posts a new operation, the runtime creates and enqueues a new operation structure to the corresponding operation list; when this operation is completed, the runtime dequeues and frees the operation structure from the list.

 struct OP_ELEM {
   all OP informations;
   MPID_Request *req;
   struct OP_ELEM * prev, * next;
 }

(2) Target Element: contains pointer to an operation list that stores all RMA operations to the same target. It also contains the epoch state of the target in cases where a per-target epoch is used (details below). When the origin first communicates with a target, MPICH creates and enqueues a new target element to the corresponding target list. When the origin finishes communication with that target, or all internal resources for targets are used up, MPICH dequeues and frees the target element.

 struct TARGET_ELEM {
   struct OP_ELEM * ops_list;
   stuct TARGET_ELEM * prev, * next;
   int rank;
   int ops_count;
   int target_state; // LOCK_CALLED/LOCK_ISSUED/LOCK_GRANTED 
   int target_type; // SHARED/EXCLUSIVE
   int target_mode; // MODE_NO_CHECK
 }

(3) Slot: contains pointer to a target list. Distribution of targets among slots is currently round-robin. During window creation, MPICH allocates a fixed-sized slot array on the window (size of slot array is controllable by the user).

 struct SLOT {
   struct TARGET_ELEM * targets_list;
 }

Note that there is a variable on the window indicating how many slots have non-NULL target list. It helps optimizing ISSUE_OPS and CLEANUP_OPS (see Algorithms for RMA synchronizations), by avoiding traversing all slots.

2. Performance issues

For every RMA operation routine, MPICH needs to search in the corresponding target list to find the correct target, which may introduce some overhead when posting operations. However, with a fixed size of the slot array, the overhead of looking up is linear with the number of targets the origin is actively communicating with. For applications that only communicate with a small number of processes, the operation posting time is usually constant. But if the application is communicating with a large number of processes or with more than one process that map to the same slot, there is a linear lookup for the target list within the slot, which can cause overhead.

3. Potential benefits

  1. Operations to different targets are not mixed in a single list, but are stored in separate lists. This has many benefits. For example, for single target epochs, we can judge if all operations to a target have completed (e.g., during a FLUSH operation), and we can find the last operation to a target easily without counting the total number of operations. For multiple targets epoch, we can optimize the garbage collection function by stopping the search in the current target's operation list if we meet incomplete operations (since it is likely that the following operations in that list are also incomplete) and jump to another target's operation list.
  2. Scalability: the size of slots array is fixed and set by the user and is scalable when handling large number of processes on the window.
  3. The garbage collection function can easily distinguish between operations that are not issued and operations that are issued but not completed yet.


Operation and Target Element Pools

We use element pools for the operation elements and the target elements. The pools are created at MPI initialization time and are shared among all windows. The size of each pool is fixed and can be configured by the user at runtime.

When we need a new element, we first check if one is available in the corresponding pool. If not, we will call the RESTORE_OP_ELEM_RESOURCES or RESTORE_TARGET_ELEM_RESOURCES functions (see Basic routines) to free up existing elements.


Epoch States

RMA epoch states are of two forms: window states and per-target states. Window states are states that affect all targets on the window, while per-target states only affect the individual target they are set for.

There are nine window states:

  1. MPIDI_EPOCH_UNSET
  2. MPIDI_EPOCH_FENCE_ISSUED
  3. MPIDI_EPOCH_FENCE_GRANTED
  4. MPIDI_EPOCH_PSCW_ISSUED
  5. MPIDI_EPOCH_PSCW_GRANTED
  6. MPIDI_EPOCH_LOCK_ALL_CALLED
  7. MPIDI_EPOCH_LOCK_ALL_ISSUED
  8. MPIDI_EPOCH_LOCK_ALL_GRANTED
  9. MPIDI_EPOCH_PER_TARGET

There are three per-target states:

  1. MPIDI_EPOCH_LOCK_CALLED
  2. MPIDI_EPOCH_LOCK_ISSUED
  3. MPIDI_EPOCH_LOCK_GRANTED


Basic routines

1. ISSUE_OPS_TARGET / ISSUE_OPS_WIN / ISSUE_OPS_GLOBAL: Nonblocking call. They try to issue all pending operations for one target/one window/all windows as many as possible. They return number of operations issued.

Note that for ISSUE_OPS_WIN, the function will issue operations in a round-robin fashion to prevent overloading one target with lots of messages at once.

2. CLEANUP_OPS_TARGET / CLEANUP_OPS_WIN / CLEANUP_OPS_GLOBAL: Nonblocking call. They try to find completed operations and clean them up for one target/one window/all windows as many as possible. They return number of operations cleaned up.

3. MAKE_RMA_PROGRESS_TARGET / MAKE_RMA_PROGRESS_WIN / MAKE_RMA_PROGRESS_GLOBAL: Nonblocking call. They try to issue pending operations and clean up issued operations for one target/one window/all windows as many as possible.

 MAKE_RMA_PROGRESS_GLOBAL () {
   do {
     x = ISSUE_OPS_GLOBAL();
     y = CLEANUP_OPS_GLOBAL();
   } while (x+y != 0);
 }

MAKE_RMA_PROGRESS_GLOBAL is called from progress engine; MAKE_RMA_PROGRESS_WIN is called in multiple target cases (FENCE/PSCW/LOCK_ALL/FLUSH_ALL); MAKE_RMA_TARGET is called in single target cases (LOCK/FLUSH).

4. PROGRESS_WAIT: it is a blocking call. It keeps poking progress engine until a completion signal is caught in the progress engine. Note that for multithreading case, one thread which is blocked in the function may yield CPU to other threads.

5. RESTORE_TARGET_ELEM_RESOURCES: it is a blocking call. It tries to make progress and poke progress engine until there are Target Element resources available.

 RESTORE_TARGET_ELEM_RESOURCES() {
   if (window_state == FENCE_GRANTED || window_state == PSCW_GRANTED || window_state == LOCK_ALL_GRANTED || window_state == LOCK_ALL_CALLED) {
     pick one Target Element K;
     do MAKE_RMA_PROGRESS and PROGRESS_WAIT until K’s operation list is empty;
     send (or piggyback) RMA_DONE packet to K;   // don't wait for RMA_ACK to arrive
     win_ptr->ack_count++;
     free Target Element K;
   }
  
   else if (window_state == PER_TARGET) {
     pick one Target Element K in priority order LOCK_GRANTED, LOCK_ISSUED, LOCK_CALLED;
     do MAKE_RMA_PROGRESS and PROGRESS_WAIT until K’s operation list is empty;
     send (or piggyback) RMA_DONE packet to K;   // don't wait for RMA_ACK to arrive
     win_ptr->ack_count++;
     free Target Element K;
   }
 }

Note that for multiple target cases (FENCE/PSCW/LOCK_ALL) we send RMA_DONE packets but do not need to wait for RMA_ACK packets in RESTORE_TARGET_ELEM_RESOURCES, because we can just track total number of RMA_DONE packet and wait for all of them in closing synchronization calls, however, we cannot do the same thing for single target cases (LOCK), because we need an O(P) array to track number of RMA_DONE packets to each target.

6. RESTORE_OP_ELEM_RESOURCES: it is a blocking call. It tries to make progress and poke progress engine until there are Operation Element resources available. It has risk of deadlock due to piggyback optimization (see Performance optimizations).

Suppose every process in the epoch queues up operations in order to do piggyback later in the synchronization. At one point, Operation Elements resources are all used up, and everyone is blocked waiting for resources to be available. Since no one makes progress on issuing operations, resources will never be available again, leads to deadlock situation. To avoid the deadlock, we need to disable piggyback optimization in order to issue operations. We define a global variable piggyback_flag to indicate if piggyback is currently enabled or not. Following pseudo-code shows when to disable piggyback optimization.

 x = GET_OP_ELEM();
 if (x == NULL) {
   // run out of resources
 
   // first try to make progress on all windows
   MAKE_RMA_PROGRESS_GLOBAL();
 
   x = GET_OP_ELEM();
   while (x == NULL) {
     // if still run out of resources, disable piggyback, make progress
     piggyback_flag = 0;
     MAKE_RMA_PROGRESS_GLOBAL();
     PROGRESS_WAIT();
     x = GET_OP_ELEM();
     piggyback_flag = 1;
   }
 }

Algorithm for FENCE

Fence-states.jpg Legends.jpg

Note that there is a request pointer on window called fence_req which stores request of IBARRIER. For details on the memory barrier semantics, see RMA + shared memory.

 FENCE() {
   // Memory barrier for shared memory operations
   if (SHM is allocated) do memory barrier;
   
   // If a previous request exists, clean it up
   if (win_ptr->fence_req != NULL) {  // A fence has been previously issued
     if (MPI_MODE_NOPRECEDE) {
       decrement ref count of fence_req by 1;  // let the progress engine delete this request
       set win_ptr->fence_req to NULL;
     }
     else {
       do PROGRESS_WAIT until previous IBARRIER has completed;  // fence_req on window is deleted now
       free win_ptr->fence_req and set it to NULL;
       set window state to FENCE_GRANTED;
     }
   }
   
   // Synchronize as needed
   if (MPI_MODE_NOPRECEDE) { // No operations to complete
     if (SHM is allocated) perform a local BARRIER;  // So that we can issue local operations immediately
     do a global IBARRIER, store the request to fence_req on window;
     set global state to FENCE_ISSUED;
   }
   else { // some operations need to be completed
     do MAKE_RMA_PROGRESS to issue all operations in Operation Table;
     issue (or piggyback) RMA_DONE packets to all active targets and increment ack_counter;
     do PROGRESS_WAIT until Operation Table is empty; // all operations are completed;
     do PROGRESS_WAIT until I received RMA_DONE_ACK packets from all active targets (decrement ack_counter, wait until it reaches 0);
     do a global IBARRER;
     set global state to FENCE_ISSUED; // I am done with all my outgoing messages
     do PROGRESS_WAIT until IBARRIER is completed;  // fence_req on window is freed now
     set global state to FENCE_GRANTED; // I know that everyone else is done
   }
   if (MPI_MODE_NOSUCCEED) { // No operations will start after this
     set global state to UNSET;
   }
   
   // Memory barrier for shared memory operations
   if (SHM is allocated) do memory barrier;
 }
 RMA_OP() {
   // Issue local operations immediately
   if (SHM is allocated && target is on the same node) {
       issue operation directly
       return;
   }
   
   // Non-local operation
   do {
       if (window state is FENCE_GRANTED) {
           if (#ops <= PIGGYBACK_THRESHOLD)
               queue up operation and break;  // might need to wait for target and op elements to be available
           else
               issue operation and break;  // if issue didn't complete, might need to queue up request
       }
       else if (window state is FENCE_ISSUED) {
           do progress wait till IBARRIER has completed || (we have a target element && a op element);
           if (IBARRIER completed) {
               set window state to FENCE_GRANTED
               continue;
           }
           else {
               queue up operation and break;  // might need to wait for target and op elements to be available
           }
       }
    } while (1);
 }

Note that if the very last FENCE is not called with MPI_MODE_NOSUCCEED, the global state is still FENCE_GRANTED, other RMA synchronizations will directly transit the global state to correct states; if the very last FENCE is called with MPI_MODE_NOPRECEDE, the global state is still FENCE_ISSUED, other RMA synchronizations will also directly transit the global state to correct states, however, fence_req on window is not freed yet, which will be freed either when we meet another MPI_WIN_FENCE or when we reach MPI_WIN_FREE.


Algorithm for POST-START-COMPLETE-WAIT

Pscw-states.jpg Legends.jpg

The algorithm for PSCW is roughly the same with the algorithm for FENCE, except for the following points:

(1) Every place we did a IBARRIER in the FENCE algorithm, we now do ISENDs (on the target) and IRECVs (on the origin) among the group of processes calling PSCW.

(2) In WIN_POST, if SHM is allocated, the targets wait until all local ISENDs have completed; similarly, in WIN_START, if SHM is allocated, the origins wait until all local IRECVs have completed.

(3) Every place we waited for the IBARRIER request to complete in the FENCE algorithm, we now wait for an array of requests (from ISENDs/IRECVs) to complete instead. The request array is allocated on the window. The size of this array is as large as the process group size used in PSCW, and can be large if the application uses PSCW with a large number of processes. However, such applications should be using FENCE instead of PSCW.


Algorithm for LOCK-UNLOCK

Here Target Element may be initialized in two situations: either in WIN_LOCK or in operation routines where we cannot find the corresponding Target Element in Operation Table. For the first situation, the Target Element is created with state LOCK_CALLED; for the second situation, the Target Element is created with state LOCK_GRANTED.

When Target Element is initialized in operation routines, we lose the lock type (SHARED/EXCLUSIVE) and lock mode (MODE_NO_CHECK) for this target. For lock type (SHARED/EXCLUSIVE), it is OK, because we don't need it for LOCK_GRANTED state; for lock mode (MODE_NO_CHECK), it is not quite good. For now we just send UNLOCK message anyway, if UNLOCK packet handler cannot find corresponding origin in lock queue (which means this LOCK is with MODE_NO_CHECK), it will just ignore this UNLOCK message.

Note that we don't free Target Element if we issue and complete all operations to this target; we free Target Element only after we sent RMA_DONE and received RMA_ACK packet from that target.

Lock-states.jpg Legends.jpg

WIN_LOCK() {
   set global state to PER_TARGET;
   if (target_rank == my_rank) {
     do PROGRESS_WAIT until I got the lock;
     exit;
   }
   if (SHM is allocated && target is a local process) {
     send lock request message;
     do PROGRESS_WAIT until receiving lock granted message;
     exit;
   }
   // create Target Element for this target
   if (run out of Target Element resources)
     do RESTORE_TARGET_ELEM_RESOURCES;
   create Target Element, queue it up in Operation Table;
   if (MPI_MODE_NOCHECK)
     set per-target state to LOCK_GRANTED;
   else
     set per-target state to LOCK_CALLED;
 }
 RMA_OP() {
   if (global state is PER_TARGET) {
     if (Target Element for this target does not exist) {
       // create Target Element for this target
       if (run out of Target Element resources)
         do RESTORE_TARGET_ELEM_RESOURCES;
       create Target Element, queue it up in Operation Table;
       set per-target state to LOCK_GRANTED;
     }
     // now Target Element for this target exists in Operation Table
     if (per-target state is LOCK_CALLED) {
       if (ops_list is empty && basic_datatype && size <= single_op_opt_threshold) {
         // queue this operation up for single operation optimization
         if (run out of Operation Element resources)
           do RESTORE_OP_ELEM_RESOURCES;
         create Operation Element, queue it up in Operation Table;
         exit;
       }
       else {
         send lock request message;
         set per-target state to LOCK_ISSUED;
         // queue this operation up
         if (run out of Operation Element resources)
           do RESTORE_OP_ELEM_RESOURCES;
         create Operation Element, queue it up in Operation Table;
         exit;
       }
     }
     else if (per-target state is LOCK_ISSUED) {
       // queue this operation up
       if (run out of Operation Element resources)
         do RESTORE_OP_ELEM_RESOURCES;
       create Operation Element, queue it up in Operation Table;
       exit;
     }
     else {  // per-target state is LOCK_GRANTED
       if (# of queued operations to this target <= piggyback_threshold) {
         // do piggyback, queue up this operation
         if (run out of Operation Element resources)
           do RESTORE_OP_ELEM_RESOURCES;
         create Operation Element, queue it up in Operation Table;
         exit;
       }
       else {
         // don’t do piggyback
         issue all queued Operation Elements to this target;
         exit;
       }
     }
   }
 }
 WIN_UNLOCK() {
   if (Target Element for this target does not exist) {
      // create Target Element for this target
      if (run out of Target Element resources)
        do RESTORE_TARGET_ELEM_RESOURCES;
      create Target Element, queue it up in Operation Table;
      set per-target state to LOCK_GRANTED;
   }
   // now Target Element for this target exists in Operation Table
   if (global state is LOCK_CALLED) {
     // single operation optimization
     issue LOCK-OP-UNLOCK to this target;
     do PROGRESS_WAIT until RMA_ACK is received;
     free this Target Element;
     set global state to UNSET;
     exit;
   }
   if (global state is LOCK_ISSUED) {
     do PROGRESS_WAIT until lock granted message is received;
     set per-target state to LOCK_GRANTED;
   }
   if (global state is LOCK_GRANTED) {
     do MAKE_RMA_PROGRESS to issue all operations in Operation Table;
     issue (or piggyback) RMA_DONE+UNLOCK packet to target;
     do PROGRESS_WAIT until operation list for this target is empty; // all operations are completed;
     do PROGRESS_WAIT until I received RMA_ACK packet from target;
     free this Target Element;
     set global state to UNSET;
   }
 }


Algorithm for LOCK_ALL-UNLOCK_ALL

One difference between algorithms for LOCK_ALL-UNLOCK_ALL and LOCK-UNLOCK is that, we don't create any Target Elements in WIN_LOCK_ALL call (we create them in operation calls), whereas we create Target Element in WIN_LOCK call. The reason is that, in WIN_LOCK_ALL, it is very likely that process does not talk to every other processes during the epoch (or even only talks with very small group of processes), so we want to avoid creating unnecessary Target Elements. However, in WIN_LOCK, it is most likely that the process will talk with this target in the epoch, so we create the Target Element in WIN_LOCK.

We use two different protocols for LOCK_ALL-UNLOCK_ALL:

(1) Per-target protocol: When the number of targets I am concurrently talking with is small and Target Element resources are still available, we send lock request whenever we create a Target Element for one target. After that, we wait for the lock granted from that target, and then we start issuing operations. The global state is always LOCK_ALL_CALLED for this protocol.

(2) Global protocol: When the number of targets I am concurrently talking with becomes large, which makes Target Element resources be used up, we send lock requests to every processes on window (global state is changed to LOCK_ALL_ISSUED) and then wait for all of them to be granted. After it is finished, the global state is set to LOCK_ALL_GRANTED.

The reason of using two protocols is that, only using per-target protocol is not always correct. When we create a new Target Element, we don't know if the lock for that target has already been issued/granted or not. It may be the very first time to create Target Element for that target (lock is not issued/granted), or maybe we already created Target Element for that target before but just free it because of running out of resources (lock is already granted). Therefore we need global protocol to wait until lock on every processes to be granted. After that, when we create a new Target Element, we always set the per-target state of it to be LOCK_GRANTED. Only using global protocol is correct but will cause bad performance when number of active targets is small. If I am only talking with very small number of targets, we always need to wait for all locks to be granted at beginning, which causes unnecessary communication overhead. Therefore, we use two protocols in different situations: when number of active targets is small, we use per-target protocol; when number of active targets becomes large and Target Elements resources are used up, we switch to global protocol.

Note that when WIN_LOCK with MODE_NO_CHECK is called, we directly go into global protocol (but don't need to wait for lock granted messages). The global state is set to LOCK_ALL_GRANTED at beginning.

Lock-all-states.jpg Legends.jpg

(Simplified version of pseudo-code)

 LOCK_ALL() {
   if (MPI_MODE_NO_CHECK)
     set global state to LOCK_ALL_GRANTED;
   else
     set global state to LOCK_ALL_CALLED;
 }
 RMA_OPS() {
   if (global state is LOCK_ALL_CALLED) {
     if (no Target Element for this target) {
       // create a Target Element for this target, set per-target state to LOCK_CALLED
     }
   }
 
   if (global state is LOCK_ALL_GRANTED) {
     if (no Target Element for this target) {
       // create a Target Element for this target, set per-target state to LOCK_GRANTED
     }
   }
 
   if (per-target state is LOCK_CALLED) {
     // if single op optimization is satisfied, queue it up, exit
     // if single op optimization is not satisfied, queue it up, send lock request, set per-target state to LOCK_ISSUED, exit
   }
 
   if (per-target state is LOCK_ISSUED) {
     // queue it up, exit
   }
 
   if (per-target state is LOCK_GRANTED) {
     // if piggyback_threshold is not reached, queue it up, exit
     // if piggyback_threshold is reached, call MAKE_RMA_PROGRESS
   }
 }
 UNLOCK_ALL () {
   while (Operation Table is not empty) {
     call MAKE_RMA_PROGRESS;  // issue as fast as possible (e.g. don't wait on one target(slot) to finish before starting the next target (slot))
     call PROGRESS_WAIT;
   }
   set global state to UNSET;
 }


Algorithm for FLUSH

WIN_FLUSH is roughly the same with WIN_UNLOCK, except that it doesn't issue (or piggyback) UNLOCK message, but only issue (or piggyback) RMA_DONE message. Also it does not change any global state.


Performance optimizations

1. Piggyback

  1. Single short operation optimization: if there is only one operation between WIN_LOCK and WIN_UNLOCK, data type is basic datatype, and data volume is smaller than single_op_opt_threshold, we piggyback LOCK message and UNLOCK message with this operation. Specifically, we will not issue lock request packet and unlock packet, instead, we only send one packet in WIN_UNLOCK which contains lock request, operation data and unlock flag.
  2. piggyback RMA_DONE (UNLOCK) packet: if there are small number of operations posted, we piggyback the RMA_DONE packet with the last operation. Specifically, when number of queued operations is smaller than piggyback_threshold, we send the lock request packet but continue to queue operations up until reach the ending synchronization. If there are GET operations in the operation list, we move that GET to the tail, in such case, the completion of that GET operation indicates the completion of all RMA operations on target; if there is no GET operations in the operation list, we piggyback RMA_DONE packet with the last operation and waits for RMA_ACK packet to arrive.


Correctness issues

RMA + threads

PROGRESS_WAIT can only happen in main function calls (RMA operations, RMA synchronizations), so we need to make sure that RMA with multithreading works correctly. Interleaving situations that need to be considered are as follows:

1. PUT/GET/ACC + PUT/GET/ACC

2. PUT/GET/ACC + WIN_FENCE (no MODE_NOSUCCEED and no MODE_NOPRECEDE)

3. PUT/GET/ACC + WIN_FLUSH

4. WIN_FLUSH + WIN_FLUSH

Except for those, any other interleaving situations are not correct programs, so runtime should not need to deal with them.

Drop FLUSH state to enable multithreading in WIN_FLUSH: in the old RMA implementation, only one thread can enter WIN_FLUSH at one time. Other threads will be blocked at the entrance of WIN_FLUSH until this thread finishes its WIN_FLUSH. This is not efficient because it causes long idle waiting time when the thread is waiting for completion of all operations. In our new design, we enable multithreading in WIN_FLUSH so that when one thread is waiting for completion of all operations, it can yield to other threads's WIN_FLUSH.

To achieve this, we need to drop the FLUSH state in old design. The FLUSH state indicates if the current FLUSH epoch is completed or not, however, now the FLUSH state is not thread-safe and can be set by multiple threads. To solve this, we piggyback the FLUSH_ACK with the last packet instead. Specifically, each thread creates a response request and pass it to the last packet. That response request is completed and freed only when FLUSH_ACK packet arrives.


RMA + shared memory

When SHM is allocated for RMA window, we need to add memory berriers at proper places in RMA synchronization routines to guarantee the ordering of read/write operations, so that any operations after synchronization calls will see the correct data.

There are four kinds of operations involved in the following explanation:

(1) Local loads/stores: any operations happening outside RMA epoch and accessing each process's own window memory.

(2) SHM operations: any operations happening inside RMA epoch. They may access any processes' window memory, which include direct loads/stores, and RMA operations that are internally implemented as direct loads/stores in MPI implementation.

(3) PROC_SYNC: synchronizations among processes by sending/receiving messages.

(4) MEM_SYNC: a full memory barrier. It ensures the ordering of read/write operations on each process.


Followings are explanations about when memory barriers should be called in different RMA synchronizations.

1. FENCE synchronization

              RANK 0                           RANK 1
     
      (local loads/stores)            (local loads/stores)
     
          WIN_FENCE {                      WIN_FENCE {
              MEM_SYNC                       MEM_SYNC
              PROC_SYNC -------------------- PROC_SYNC
              MEM_SYNC                       MEM_SYNC
          }                              }
   
       (SHM operations)                (SHM operations)
   
          WIN_FENCE {                    WIN_FENCE {
              MEM_SYNC                       MEM_SYNC
              PROC_SYNC -------------------- PROC_SYNC
              MEM_SYNC                       MEM_SYNC
          }                              }
   
     (local loads/stores)             (local loads/stores)

We need MEM_SYNC before and after PROC_SYNC for both starting WIN_FENCE and ending WIN_FENCE, to ensure the ordering between local loads/stores and PROC_SYNC in starting WIN_FENCE (and vice versa in ending WIN_FENCE), and the ordering between PROC_SYNC and SHM operations in starting WIN_FENCE (and vice versa for ending WIN_FENCE).

In starting WIN_FENCE, the MEM_SYNC before PROC_SYNC essentially exposes previous local loads/stores to other processes; after PROC_SYNC, each process knows that everyone else already exposed their local loads/stores; the MEM_SYNC after PROC_SYNC ensures that my following SHM operations will happen after PROC_SYNC and will see the latest data on other processes.

In ending WIN_FENCE, the MEM_SYNC before PROC_SYNC essentially exposes previous SHM operations to other processes; after PROC_SYNC, each process knows everyone else already exposed their SHM operations; the MEM_SYNC after PROC_SYNC ensures that my following local loads/stores will happen after PROC_SYNC and will see the latest data in my memory region.

2. POST-START-COMPLETE-WAIT synchronization

             RANK 0                        RANK 1
  
                                    (local loads/stores)
          
          WIN_START {                  WIN_POST {
                                           MEM_SYNC
              PROC_SYNC ------------------ PROC_SYNC
              MEM_SYNC
          }                             }
          
        (SHM operations)
  
          WIN_COMPLETE {              WIN_WAIT/TEST {
              MEM_SYNC
              PROC_SYNC ----------------- PROC_SYNC
                                          MEM_SYNC
          }                             }
    
                                    (local loads/stores)

We need MEM_SYNC before PROC_SYNC for WIN_POST and WIN_COMPLETE, and MEM_SYNC after PROC_SYNC in WIN_START and WIN_WAIT/TEST, to ensure the ordering between local loads/stores and PROC_SYNC in WIN_POST (and vice versa in WIN_WAIT/TEST), and the ordering between PROC_SYNC and SHM operations in WIN_START (and vice versa in WIN_COMPLETE).

In WIN_POST, the MEM_SYNC before PROC_SYNC essentially exposes previous local loads/stores to group of origin processes; after PROC_SYNC, origin processes knows all target processes already exposed their local loads/stores; in WIN_START, the MEM_SYNC after PROC_SYNC ensures that following SHM operations will happen after PROC_SYNC and will see the latest data on target processes.

In WIN_COMPLETE, the MEM_SYNC before PROC_SYNC essentailly exposes previous SHM operations to group of target processes; after PROC_SYNC, target processes knows all origin process already exposed their SHM operations; in WIN_WAIT/TEST, the MEM_SYNC after PROC_SYNC ensures that following local loads/stores will happen after PROC_SYNC and will see the latest data in my memory region.

3. Passive target synchronization

Note that for passive target synchronization, when one process wants to access its own window memory, it must do those accesses within an passive epoch (make local loads/stores become SHM operations), otherwise those local accesses may be reordered and leads to wrong results (clarification is needed for this in MPI specification). See following for details.

             RANK 0                          RANK 1
   
                                       WIN_LOCK(target=1) {
                                           PROC_SYNC (lock granted)
                                           MEM_SYNC
                                       }
  
                                       (SHM operations)
  
                                       WIN_UNLOCK(target=1) {
                                           MEM_SYNC
                                           PROC_SYNC (lock released)
                                       }
  
        PROC_SYNC -------------------- PROC_SYNC
  
        WIN_LOCK (target=1) {
            PROC_SYNC (lock granted)
            MEM_SYNC
       }
 
        (SHM operations)
 
        WIN_UNLOCK (target=1) {
            MEM_SYNC
            PROC_SYNC (lock released)
        }
 
        PROC_SYNC -------------------- PROC_SYNC
 
                                       WIN_LOCK(target=1) {
                                           PROC_SYNC (lock granted)
                                           MEM_SYNC
                                       }
  
                                       (SHM operations)
  
                                       WIN_UNLOCK(target=1) {
                                           MEM_SYNC
                                           PROC_SYNC (lock released)
                                       }
  

We need MEM_SYNC after PROC_SYNC in WIN_LOCK, and MEM_SYNC before PROC_SYNC in WIN_UNLOCK, to ensure the ordering between SHM operations and PROC_SYNC and vice versa.

In WIN_LOCK, the MEM_SYNC after PROC_SYNC guarantees two things: (a) it guarantees that following SHM operations will happen after lock is granted; (b) it guarantees that following SHM operations will happen after any PROC_SYNC with target before WIN_LOCK is called, which means those SHM operations will see the latest data on target process.

In WIN_UNLOCK, the MEM_SYNC before PROC_SYNC also guarantees two things: (a) it guarantees that SHM operations will happen before lock is released; (b) it guarantees that SHM operations will happen before any PROC_SYNC with target after WIN_UNLOCK is returned, which means following SHM operations on that target will see the latest data.

WIN_LOCK_ALL/UNLOCK_ALL are same with WIN_LOCK/UNLOCK.

             RANK 0                          RANK 1
 
        WIN_LOCK_ALL
 
        (SHM operations)
 
        WIN_FLUSH(target=1) {
            MEM_SYNC
        }
 
        PROC_SYNC ----------------------- PROC_SYNC
 
                                          WIN_LOCK(target=1) {
                                              PROC_SYNC (lock granted)
                                              MEM_SYNC
                                          }
 
                                          (SHM operations)
 
                                          WIN_UNLOCK(target=1) {
                                              MEM_SYNC
                                              PROC_SYNC (lock released)
                                          }
 
        WIN_UNLOCK_ALL

We need MEM_SYNC in WIN_FLUSH to ensure the ordering between SHM operations and PROC_SYNC.

The MEM_SYNC in WIN_FLUSH guarantees that all SHM operations before this WIN_FLUSH will happen before any PROC_SYNC with target after this WIN_FLUSH, which means SHM operations on target process after PROC_SYNC with origin will see the latest data.