Difference between revisions of "New RMA Design"

From Mpich
Jump to: navigation, search
(Ordering)
 
(152 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[Category:Design Documents]]
 
[[Category:Design Documents]]
  
 +
 +
 
 
== Goals ==
 
== Goals ==
  
Line 10: Line 12:
 
# Ensure code is thread-safe (e.g., avoid thread-private data structures).
 
# Ensure code is thread-safe (e.g., avoid thread-private data structures).
 
# Ensure integration of RMA code into the overall MPICH progress engine.
 
# Ensure integration of RMA code into the overall MPICH progress engine.
 
+
# Use hardware support for RMA.
 +
# Resource management to prevent using up all resources.
  
 
== Data structures ==
 
== Data structures ==
Line 20: Line 23:
 
[[File:Op-list-slots.jpg]]
 
[[File:Op-list-slots.jpg]]
  
There are three kinds of data structures involved:
+
There are the following kinds of data structures involved:
 +
 
 +
'''(1) Operation Element'''[[File:Circle.jpg]]: contains all origin/target information needed for this RMA operation, plus a request handle that is set when the operation is issued (and has not completed immediately).  When the user posts a new operation, the runtime creates and enqueues a new operation structure to the corresponding operation list; when this operation is completed, the runtime dequeues and frees the operation structure from the list.
 +
 
 +
If req is a NULL pointer, this operation is a PENDING operation; if req is a non-NULL pointer, this operation is a ISSUED operation.
  
'''(1) Operation Element''': contains all origin/target information needed for this RMA operation, plus a request handle that is set when the operation is issued (and has not completed immediately).  When the user posts a new operation, the runtime creates and enqueues a new operation structure to the corresponding operation list; when this operation is completed, the runtime dequeues and frees the operation structure from the list.
+
Note that currently operation list is a double-direction linked list, the prev pointer of head points to the tail, and the next pointer of tail is NULL. We may need to make op list as a single-direction linked list, and store tail pointer in target element.
  
 
   struct OP_ELEM {
 
   struct OP_ELEM {
 
     all OP informations;
 
     all OP informations;
 
     MPID_Request *req;
 
     MPID_Request *req;
     struct OP_ELEM * prev, * next;
+
     struct OP_ELEM *next, *prev;
 
   }
 
   }
  
'''(2) Target Element''': contains pointer to an operation list that stores all RMA operations to the same target.  It also contains the epoch state of the target in cases where a per-target epoch is used (details below).  When the origin first communicates with a target, MPICH creates and enqueues a new target element to the corresponding target list.  When the origin finishes communication with that target, or all internal resources for targets are used up, MPICH dequeues and frees the target element.
+
(From Prof. Gropp's comments) We may need to consider lightweight request design for RMA. Current version of request has higher overhead. For example, we should not always initialize/erase all areas when creating/freeing a request, but should only initialize/erase areas we needed.
 +
 
 +
'''(2) Target Element'''[[File:Rhombus.jpg]]: contains pointer to an operation list that stores all RMA operations to the same target.  It also contains the epoch state of the target in cases where a per-target epoch is used (details below).  When the origin first communicates with a target, MPICH creates and enqueues a new target element to the corresponding target list.  When the origin finishes communication with that target, or all internal resources for targets are used up, MPICH dequeues and frees the target element.
  
 
   struct TARGET_ELEM {
 
   struct TARGET_ELEM {
     struct OP_ELEM * ops_list;
+
     struct OP_ELEM *op_list;
     stuct TARGET_ELEM * prev, * next;
+
    struct OP_ELEM *next_op_to_issue;
 +
     stuct TARGET_ELEM *next;
 
     int rank;
 
     int rank;
     int ops_count;
+
     int ack_counter;   // for passive target
     int target_state; // LOCK_CALLED/LOCK_ISSUED/LOCK_GRANTED  
+
    int op_count;
     int target_type; // SHARED/EXCLUSIVE
+
    int sync_flag; // UNLOCK/FLUSH/FLUSH_LOCAL, for piggybacking
     int target_mode; // MODE_NO_CHECK
+
     int lock_state; // LOCK_CALLED/LOCK_ISSUED/LOCK_GRANTED  
 +
     int lock_type; // SHARED/EXCLUSIVE
 +
     int lock_mode; // MODE_NO_CHECK
 
   }
 
   }
  
'''(3) Slot''': contains pointer to a target list.  Distribution of targets among slots is currently round-robin.  During window creation, MPICH allocates a fixed-sized slot array on the window (size of slot array is controllable by the user).
+
'''(3) Slot'''[[File:Slot.jpg]]: contains pointer to a target list.  Distribution of targets among slots is currently round-robin.  During window creation, MPICH allocates a fixed-sized slot array on the window (size of slot array is controllable by the user).
  
 
   struct SLOT {
 
   struct SLOT {
Line 48: Line 60:
 
   }
 
   }
  
Note that there is a variable on the window indicating how many slots have non-NULL target list. It helps optimizing ISSUE_OPS and CLEANUP_OPS (see [https://wiki.mpich.org/mpich/index.php/New_RMA_Design#Algorithms_for_RMA_synchronizations Algorithms for RMA synchronizations]), by avoiding traversing all slots.
+
'''(4) Window'''[[File:win.jpg]]: contains at least the following information:
 +
 
 +
  struct WIN {
 +
    int win_state;  // window state
 +
    int mode;  // for MODE_NOCHECK for LOCKALL case
 +
    int ack_counter;  // for active target
 +
    int num_nonempty_slots;  // how many slots have non-NULL target list to avoid traversing slots
 +
  }
 +
 
  
 
'''2. Performance issues'''
 
'''2. Performance issues'''
Line 60: Line 80:
 
# The garbage collection function can easily distinguish between operations that are not issued and operations that are issued but not completed yet.
 
# The garbage collection function can easily distinguish between operations that are not issued and operations that are issued but not completed yet.
  
 +
== Operation and Target Element Pools ==
  
== Operation and Target Element Pools ==
+
We use two-level element pools for the operation elements and the target elements: '''window (local)''' and '''global'''.  The '''local''' pool resources can only be used by operations within the window, while the '''global''' pool resources are shared by all windows.  The sizes of both the '''local''' and '''global''' portions of each pool are fixed and can be configured by the user at runtime.
  
We use element pools for the operation elements and the target elements.  The pools are created at MPI initialization time and are shared between all windowsThe size of each pool is fixed and can be configured by the user at runtime.
+
When a new element is needed, we first check if one is available in the corresponding '''local''' pool.  If the '''local''' pool is empty, we try to find one in the '''global''' pool.  If even the '''global''' pool is empty, we will call the '''CLEANUP_WIN_AGGRESSIVE''' function (see [https://wiki.mpich.org/mpich/index.php/New_RMA_Design#Basic_routines Basic routines]) to free up existing elements.  When we are done using an element, we first return it to the '''local''' poolIf the '''local''' pool is full, we return it to the '''global''' pool.
  
When we need a new element, we first check if one is available in the corresponding pool.  If not, we will call the '''RESTORE_OP_ELEM_RESOURCES''' or '''RESTORE_TARGET_ELEM_RESOURCES''' functions (see [https://wiki.mpich.org/mpich/index.php/New_RMA_Design#Algorithms_for_RMA_synchronizations Algorithms for RMA synchronizations]) to free up existing elements.
+
This model ensures that a window can never be starved because other windows are using up all the resources.
  
 +
TODO: We should investigate allocating op/target pools on shared memory, explore if we can make progress on other processes.
  
 
== Epoch States ==
 
== Epoch States ==
Line 74: Line 96:
 
There are nine '''window states''':
 
There are nine '''window states''':
  
# MPIDI_CH3I_WIN_UNSET
+
# MPIDI_RMA_NONE
# MPIDI_CH3I_WIN_FENCE_ISSUED
+
# MPIDI_RMA_FENCE_ISSUED
# MPIDI_CH3I_WIN_FENCE_GRANTED
+
# MPIDI_RMA_FENCE_GRANTED
# MPIDI_CH3I_WIN_PSCW_ISSUED
+
# MPIDI_RMA_PSCW_ISSUED
# MPIDI_CH3I_WIN_PSCW_GRANTED
+
# MPIDI_RMA_PSCW_GRANTED
# MPIDI_CH3I_WIN_LOCK_ALL_CALLED
+
# MPIDI_RMA_LOCK_ALL_CALLED
# MPIDI_CH3I_WIN_LOCK_ALL_ISSUED
+
# MPIDI_RMA_LOCK_ALL_ISSUED
# MPIDI_CH3I_WIN_LOCK_ALL_GRANTED
+
# MPIDI_RMA_LOCK_ALL_GRANTED
# MPIDI_CH3I_WIN_PER_TARGET
+
# MPIDI_RMA_PER_TARGET
  
 
There are three '''per-target states''':
 
There are three '''per-target states''':
  
# MPIDI_CH3I_WIN_LOCK_CALLED
+
# MPIDI_RMA_LOCK_CALLED
# MPIDI_CH3I_WIN_LOCK_ISSUED
+
# MPIDI_RMA_LOCK_ISSUED
# MPIDI_CH3I_WIN_LOCK_GRANTED
+
# MPIDI_RMA_LOCK_GRANTED
  
 +
Note that '''per-target state''' is used only when '''window state''' is either '''MPIDI_RMA_LOCK_ALL_CALLED''' or '''MPIDI_RMA_PER_TARGET'''.
  
 
== Basic routines ==
 
== Basic routines ==
  
'''1. ISSUE_OPS''': it is a nonblocking call. It tries to issue all pending operations for a certain target as many as possible. It returns number of operations it issued.
+
1. '''ISSUE_OPS_TARGET / ISSUE_OPS_WIN / ISSUE_OPS_GLOBAL''': Nonblocking call. They try to issue all pending operations for one target/one window/all windows as many as possible. They return number of operations issued.
 +
 
 +
Note that for ISSUE_OPS_WIN, the function will issue operations in a round-robin fashion to prevent overloading one target with lots of messages at once.  
  
'''2. CLEANUP_OPS''': it is a nonblocking call. It tries to find completed operations and clean them up for a certain target as many as possible. It returns number of operations it cleaned up.
+
2. '''CLEANUP_TARGET / CLEANUP_WIN / CLEANUP_GLOBAL''': Nonblocking call. They try to find completed operations and targets and clean them up for one target/one window/all windows as many as possible. They return number of operations + targets cleaned up.  For active target, this function also cleans up empty target elements.  Note that for passive target, empty target functions are '''not''' cleaned up by this function and need to be cleaned up by the corresponding packet handler.  This is because in passive target, the user might issue a per-target FLUSH operation as well, in which case we need to know how many flush acknowledgments we are waiting for from a particular target.
  
'''3. MAKE_RMA_PROGRESS''': it is a nonblocking call. It tries to issue pending operations and clean up issued operations for a certain target as many as possible.
+
3. '''MAKE_RMA_PROGRESS_TARGET / MAKE_RMA_PROGRESS_WIN / MAKE_RMA_PROGRESS_GLOBAL''': Nonblocking call. They try to issue pending operations and clean up issued operations for one target/one window/all windows as many as possible.
  
   MAKE_RMA_PROGRESS (int target, MPI_Win win) {
+
   MAKE_RMA_PROGRESS_GLOBAL () {
 
     do {
 
     do {
       x = ISSUE_OPS(int target, MPI_Win win);
+
       x = ISSUE_OPS_GLOBAL();
       y = CLEANUP_OPS(int target, MPI_Win win);
+
       y = CLEANUP_OPS_GLOBAL();
 
     } while (x+y != 0);
 
     } while (x+y != 0);
 
   }
 
   }
  
Note that MAKE_RMA_PROGRESS is called whenever we want to make progress on RMA operations. Ideally it should be called from progress engine.  
+
MAKE_RMA_PROGRESS_GLOBAL is called from progress engine; MAKE_RMA_PROGRESS_WIN is called in multiple target cases (FENCE/PSCW/LOCK_ALL/FLUSH_ALL); MAKE_RMA_TARGET is called in single target cases (LOCK/FLUSH).
  
'''4. PROGRESS_WAIT''': it is a blocking call. It keeps poking progress engine until a completion signal is caught in the progress engine. Note that for multithreading case, one thread which is blocked in the function may yield CPU to other threads.  
+
4. '''PROGRESS_WAIT''': it is a blocking call. It keeps poking progress engine until a completion signal is caught in the progress engine. Note that for multithreading case, one thread which is blocked in the function may yield CPU to other threads.  
  
'''5. RESTORE_TARGET_ELEM_RESOURCES''': it is a blocking call. It tries to make progress and poke progress engine until there are Target Element resources available.
+
5. '''CLEANUP_WIN_AGGRESSIVE''': it is a blocking call. It tries to make progress and poke progress engine until there are resources available.
  
   RESTORE_TARGET_ELEM_RESOURCES() {
+
   CLEANUP_WIN_AGGRESSIVE(resource_type) {
     if (global_state == FENCE_GRANTED || global_state == PSCW_GRANTED) {
+
     if (window_state == FENCE_ISSUED) {
      pick one Target Element K;
+
      do PROGRESS_WAIT until the resource is available or window_state == FENCE_GRANTED;
       do MAKE_RMA_PROGRESS and PROGRESS_WAIT until K’s operation list is empty;
+
      if (resource is available)
       send (or piggyback) RMA_DONE packet to K (increment ack_counter);   // don't wait for RMA_ACK to arrive
+
          return;
       free Target Element K;
+
    }
 +
   
 +
    if (window_state == PSCW_ISSUED) {
 +
       do PROGRESS_WAIT until the resource is available or window_state == PSCW_GRANTED;
 +
       if (resource is available)
 +
          return;
 +
    }
 +
   
 +
    if (window_state == LOCK_ALL_ISSUED) {
 +
      do PROGRESS_WAIT until the resource is available or window_state == LOCK_ALL_GRANTED;
 +
       if (resource is available)
 +
          return;
 
     }
 
     }
     else if (global_state == PER_TARGET) {
+
      
       if (no Target Element with state LOCK_GRANTED) {
+
    if (window state == FENCE_GRANTED || window state == PSCW_GRANTED) {
        if (no Target Element with state LOCK_ISSUED) {
+
       pick one Target Element K;   // might differ based on resource type
          pick one Target Element K with LOCK_CALLED;
+
      K->sync_flag = FLUSH_LOCAL;
          send lock request message;
+
      call PROGRESS_WAIT until the resource is available;
          set state to LOCK_ISSUED;
+
      return;
        }
+
    }
        else
+
   
          pick one Target Element K with state LOCK_ISSUED;
+
    if (window state == PER_TARGET || window_state == LOCK_ALL_CALLED || window_state == LOCK_ALL_GRANTED) {
        PROGRESS_WAIT for lock granted message;
+
      pick one Target Element K in priority order LOCK_GRANTED, LOCK_ISSUED, LOCK_CALLED; // might differ based on resource type
        set state to LOCK_GRANTED;
+
       if (resource_type == op element)
      }
+
          K->sync_flag = FLUSH_LOCAL;
      else {
+
       else  // target element is only freed in the packet handler for passive target
        pick one Target Element K with state is LOCK_GRANTED;
+
          K->sync_flag = FLUSH;
       }
+
       call PROGRESS_WAIT until the resource is available;
      do MAKE_RMA_PROGRESS and PROGRESS_WAIT until K’s operation list is empty;
+
       return;
       send (or piggyback) RMA_DONE packet to K;
 
       do PROGRESS_WAIT until RMA_ACK packet is arrive;
 
       free Target Element K;
 
 
     }
 
     }
 
   }
 
   }
 
   
 
   
Note that for FENCE_GRANTED and PSCW_GRANTED, we send RMA_DONE packets but do not need to wait for RMA_ACK packets in RESTORE_TARGET_ELEM_RESOURCES, because we can just track total number of RMA_DONE packet and wait for all of them in closing synchronization calls, however, we cannot do the same thing for PER_TARGET, because we need an potentially O(P) array to track number of RMA_DONE packets to each target.
+
Note that for multiple target cases (FENCE/PSCW) we only do a FLUSH_LOCAL because we can track the total number of RMA_DONE packets in the window ('''win_ptr->ack_counter''') and wait for all of them in closing synchronization calls.  However, we cannot do the same thing for passive target epochs because we need to know how many RMA_DONE_ACK packets we are expecting from a particular target if the user calls a FLUSH operation.
 
'''6. RESTORE_OP_ELEM_RESOURCES''': it is a blocking call. It tries to make progress and poke progress engine until there are Operation Element resources available.
 
 
 
  RESTORE_OP_ELEM_RESOURCES {
 
    do MAKE_RMA_PROGRESS and PROGRESS_WAIT until there are available Operation Element resources;
 
  }
 
  
  
Line 154: Line 181:
 
[[File:Fence-states.jpg]] [[File:Legends.jpg]]
 
[[File:Fence-states.jpg]] [[File:Legends.jpg]]
  
Note that there is a request pointer on window called '''fence_req''' which stores request of IBARRIER.  
+
Note that there is a request pointer on window called '''fence_req''' which stores request of IBARRIER.  For details on the memory barrier semantics, see [https://wiki.mpich.org/mpich/index.php/New_RMA_Design#RMA_.2B_shared_memory RMA + shared memory].
  
 
   FENCE() {
 
   FENCE() {
Line 169: Line 196:
 
       }
 
       }
 
     }
 
     }
 
+
   
 
     // Synchronize as needed
 
     // Synchronize as needed
     if (MPI_MODE_NOPRECEDE && MPI_MODE_NOSUCCEED) { // No operations to complete and no operations will start after this
+
     if (MPI_MODE_NOPRECEDE) { // No operations to complete
      set global state to UNSET;
 
    }
 
    else if (MPI_MODE_NOPRECEDE) { // No operations to complete
 
 
       if (SHM is allocated) perform a local BARRIER;  // So that we can issue local operations immediately
 
       if (SHM is allocated) perform a local BARRIER;  // So that we can issue local operations immediately
 
       do a global IBARRIER, store the request to fence_req on window;
 
       do a global IBARRIER, store the request to fence_req on window;
       set global state to FENCE_ISSUED;
+
       set window state to FENCE_ISSUED;
 
     }
 
     }
 
     else { // some operations need to be completed
 
     else { // some operations need to be completed
Line 185: Line 209:
 
       do PROGRESS_WAIT until I received RMA_DONE_ACK packets from all active targets (decrement ack_counter, wait until it reaches 0);
 
       do PROGRESS_WAIT until I received RMA_DONE_ACK packets from all active targets (decrement ack_counter, wait until it reaches 0);
 
       do a global IBARRER;
 
       do a global IBARRER;
       set global state to FENCE_ISSUED; // I am done with all my outgoing messages
+
       set window state to FENCE_ISSUED; // I am done with all my outgoing messages
 
       do PROGRESS_WAIT until IBARRIER is completed;  // fence_req on window is freed now
 
       do PROGRESS_WAIT until IBARRIER is completed;  // fence_req on window is freed now
       set global state to FENCE_GRANTED; // I know that everyone else is done
+
       set window state to FENCE_GRANTED; // I know that everyone else is done
 +
    }
 +
    if (MPI_MODE_NOSUCCEED) { // No operations will start after this
 +
      set window state to UNSET;
 
     }
 
     }
 
   }
 
   }
  
 
   RMA_OP() {
 
   RMA_OP() {
     if (global state is FENCE_ISSUED) {
+
    // Issue local operations immediately
      if (request of IBARRIER is completed) {
+
     if (SHM is allocated && target is on the same node) {
         set global state to FENCE_GRANTED;
+
        issue operation directly
      }
+
         return;
      else {
+
    }
        // queue up this operation
+
   
         if (Target Element resources available && Operation Element resources available) {
+
    // Non-local operation
          create Target Element/Operation Element, queue it up in Operation Table;
+
    do {
          exit;
+
         if (window state is FENCE_GRANTED) {
 +
            if (#ops <= PIGGYBACK_THRESHOLD)
 +
                queue up operation and break; // might need to wait for target and op elements to be available
 +
            else
 +
                issue operation and break; // if issue didn't complete, might need to queue up request
 
         }
 
         }
         else {
+
         else if (window state is FENCE_ISSUED) {
          if (run out of Target Element resources) {
+
             do progress wait till IBARRIER has completed || (we have a target element && a op element);
             do PROGRESS_WAIT until IBARRIER is completed OR there are Target Element resources available; // other window may free resources
+
            if (IBARRIER completed) {
            if (Target Element resources available) {
+
                 set window state to FENCE_GRANTED
              create Target Element, queue it up in Operation Table;
+
                continue;
              if (Operation Element resources available) {
 
                 create Operation Element, queue it up in Operation Table;
 
                exit;
 
              }
 
            }
 
            else { // request of IBARRIER is completed
 
              set global state to FENCE_GRANTED;
 
            }
 
          }
 
          if (run out of Operation Element resources) {
 
            do PROGRESS_WAIT until IBARRIER is completed OR there are Operation Element resources available; // other window may free resources
 
            if (Operation Element resources available) {
 
              create Operation Element, queue it up in Operation Table;
 
              exit;
 
 
             }
 
             }
             else { // request of IBARRIER is completed
+
             else {
              set global state to FENCE_GRANTED;
+
                queue up operation and break;  // might need to wait for target and op elements to be available
 
             }
 
             }
          }
 
 
         }
 
         }
      }
+
    } while (1);
    }
 
    if (global state is FENCE_GRANTED) {
 
      if (# of queued operations to this target <= piggyback_threshold) {
 
        // do piggyback, queue up this operation
 
        if (run out of Target Element resources) {
 
          do RESTORE_TARGET_ELEM_RESOURCES;
 
          create Target Element, queue it up in Operation Table;
 
        }
 
        if (run out of Operation Element resources)
 
          do RESTORE_OP_ELEM_RESOURCES;
 
        create Operation Element, queue it up in Operation Table;
 
      }
 
      else {
 
        // don’t do piggyback
 
        issue all queued Operation Elements to this target;
 
      }
 
    }
 
 
   }
 
   }
  
Note that if the very last FENCE is not called with MPI_MODE_NO_SUCCEED, the global state is still FENCE_GRANTED, other RMA synchronizations will directly transit the global state to correct states; if the very last FENCE is called with MPI_MODE_NO_PRECEDE, the global state is still FENCE_ISSUED, other RMA synchronizations will also directly transit the global state to correct states, however, fence_req on window is not freed yet, which will be freed either when we meet another MPI_WIN_FENCE or when we reach MPI_WIN_FREE.
+
Note that if the very last FENCE is not called with MPI_MODE_NOSUCCEED, the window state is still FENCE_GRANTED, other RMA synchronizations will directly transit the window state to correct states; if the very last FENCE is called with MPI_MODE_NOPRECEDE, the window state is still FENCE_ISSUED, other RMA synchronizations will also directly transit the window state to correct states, however, fence_req on window is not freed yet, which will be freed either when we meet another MPI_WIN_FENCE or when we reach MPI_WIN_FREE.
 +
 
  
 
== Algorithm for POST-START-COMPLETE-WAIT ==
 
== Algorithm for POST-START-COMPLETE-WAIT ==
Line 255: Line 255:
 
The algorithm for PSCW is roughly the same with the algorithm for FENCE, except for the following points:  
 
The algorithm for PSCW is roughly the same with the algorithm for FENCE, except for the following points:  
  
(1) Every places we do a IBARRIER in FENCE algorithm, we do ISENDs (on target) and IRECVs (on origin) among group of processes instead.
+
(1) Every place we did a IBARRIER in the FENCE algorithm, we now do ISENDs (on the target) and IRECVs (on the origin) among the group of processes calling PSCW.
 +
 
 +
(2) In WIN_POST, if SHM is allocated, the targets wait until all local ISENDs have completed; similarly, in WIN_START, if SHM is allocated, the origins wait until all local IRECVs have completed.
  
(2) In WIN_POST, if SHM is allocated, the targets wait until all local ISENDs are completed; similarly, in WIN_START, if SHM is allocated, the origins wait until all local IRECVs are completed.
+
(3) Every place we waited for the IBARRIER request to complete in the FENCE algorithm, we now wait for an array of requests (from ISENDs/IRECVs) to complete instead.  The request array is allocated on the window.  The size of this array is as large as the process group size used in PSCW, and can be large if the application uses PSCW with a large number of processes.  However, such applications should be using FENCE instead of PSCW.
  
(3) Every places we waits for the request of IBARRIER to be completed in FENCE algorithm, we waits for an array of requests (from ISENDs/IRECVs) to be completed instead. We allocated the request array on window. This is unscalable if origin has many active targets concurrently. FENCE is a better choice for such applications.
+
(From Prof. Gropp's comments) For middle scale applications, it is not good to always do direct send/recv for synchronization. We should consider algorithms (like tree-based algorithm) to optimize such cases.
  
 +
Note that all states are only for origin processes, for target processes, the state is always MPIDI_RMA_NONE.
  
 
== Algorithm for LOCK-UNLOCK ==
 
== Algorithm for LOCK-UNLOCK ==
  
Here Target Element may be initialized in two situations: either in WIN_LOCK or in operation routines where we cannot find the corresponding Target Element in Operation Table. For the first situation, the Target Element is created with state LOCK_CALLED; for the second situation, the Target Element is created with state LOCK_GRANTED.
+
In this algorithm, a target element might be created in two situations: (1) in the WIN_LOCK call the target element might be created for the first time (it is created in the LOCK_CALLED stated), or (2) in an RMA operation or WIN_UNLOCK call the target element might need to be recreated because it was cleaned up when someone else needed the resource (it is created in the LOCK_GRANTED state).
  
When Target Element is initialized in operation routines, we lose the lock type (SHARED/EXCLUSIVE) and lock mode (MODE_NO_CHECK) for this target. For lock type (SHARED/EXCLUSIVE), it is OK, because we don't need it for LOCK_GRANTED state; for lock mode (MODE_NO_CHECK), it is not quite good. For now we just send UNLOCK message anyway, if UNLOCK packet handler cannot find corresponding origin in lock queue (which means this LOCK is with MODE_NO_CHECK), it will just ignore this UNLOCK message.  
+
When the target element is recreated in the second case above, we lose the lock information such as the type of lock (SHARED/EXCLUSIVE) and mode (e.g., MODE_NOCHECK) for this target. The lock type (SHARED/EXCLUSIVE) information is not necessary because we do not need it once we are in the LOCK_GRANTED state.  However the lack of the lock mode (MODE_NOCHECK) can hurt performance.  This is because, when MODE_NOCHECK is set, we do not acquire the lock and directly set the epoch state to LOCK_GRANTED. However, since we lost this information in the recreation, we might end up sending an UNLOCK packet even when the target was not explicitly locked.  In this case, the target is expected to process the UNLOCK message and send back an acknowledgment to the origin.
  
Note that we don't free Target Element if we issue and complete all operations to this target; we free Target Element only after we sent RMA_DONE and received RMA_ACK packet from that target.
+
Note that we free the target element only after we sent the RMA_DONE message and received the RMA_DONE_ACK message from that target.
  
 
[[File:Lock-states.jpg]] [[File:Legends.jpg]]
 
[[File:Lock-states.jpg]] [[File:Legends.jpg]]
  
 
  WIN_LOCK() {
 
  WIN_LOCK() {
     set global state to PER_TARGET;
+
     set window state to PER_TARGET;
 
     if (target_rank == my_rank) {
 
     if (target_rank == my_rank) {
 
       do PROGRESS_WAIT until I got the lock;
 
       do PROGRESS_WAIT until I got the lock;
       exit;
+
       return;
 
     }
 
     }
 
     if (SHM is allocated && target is a local process) {
 
     if (SHM is allocated && target is a local process) {
 
       send lock request message;
 
       send lock request message;
 
       do PROGRESS_WAIT until receiving lock granted message;
 
       do PROGRESS_WAIT until receiving lock granted message;
       exit;
+
       return;
 
     }
 
     }
    // create Target Element for this target
+
     create Target Element and queue it up; // might need to restore target elements if none are available
    if (run out of Target Element resources)
 
      do RESTORE_TARGET_ELEM_RESOURCES;
 
     create Target Element, queue it up in Operation Table;
 
 
     if (MPI_MODE_NOCHECK)
 
     if (MPI_MODE_NOCHECK)
 
       set per-target state to LOCK_GRANTED;
 
       set per-target state to LOCK_GRANTED;
Line 294: Line 294:
  
 
   RMA_OP() {
 
   RMA_OP() {
     if (global state is PER_TARGET) {
+
    // Issue local operations immediately
      if (Target Element for this target does not exist) {
+
     if (SHM is allocated && target is on the same node) {
         // create Target Element for this target
+
         issue operation directly
         if (run out of Target Element resources)
+
         return;
          do RESTORE_TARGET_ELEM_RESOURCES;
+
    }
        create Target Element, queue it up in Operation Table;
+
   
        set per-target state to LOCK_GRANTED;
+
    // Non-local operation
      }
+
    do {
      // now Target Element for this target exists in Operation Table
+
        if (target queue does not exist || target state is LOCK_GRANTED) {
      if (per-target state is LOCK_CALLED) {
+
            if (#ops <= PIGGYBACK_THRESHOLD)
        if (ops_list is empty && basic_datatype && size <= single_op_opt_threshold) {
+
                queue up operation and break;  // might need to wait for target and op elements to be available
          // queue this operation up for single operation optimization
+
            else
          if (run out of Operation Element resources)
+
                issue operation and break;       // if issue didn't complete, might need to queue up request
            do RESTORE_OP_ELEM_RESOURCES;
 
          create Operation Element, queue it up in Operation Table;
 
          exit;
 
 
         }
 
         }
         else {
+
         else if (target state is LOCK_ISSUED) {
          send lock request message;
+
            do progress wait till state becomes LOCK_GRANTED || (we have a target element && a op element);
          set per-target state to LOCK_ISSUED;
+
            if (state is LOCK_GRANTED) {
          // queue this operation up
+
                continue;
          if (run out of Operation Element resources)
+
             }
             do RESTORE_OP_ELEM_RESOURCES;
+
            else {
          create Operation Element, queue it up in Operation Table;
+
                queue up operation and break; // might need to wait for target and op elements to be available
          exit;
+
            }
 
         }
 
         }
      }
+
        else if (target state is LOCK_CALLED) {
      else if (per-target state is LOCK_ISSUED) {
+
            if (op list is empty && basic datatype && size <= single_op_threshold)
        // queue this operation up
+
                queue up operation and break;  // might need to wait for target and op elements to be available
        if (run out of Operation Element resources)
+
            else {
          do RESTORE_OP_ELEM_RESOURCES;
+
                issue lock operation;
        create Operation Element, queue it up in Operation Table;
+
                set state to LOCK_ISSUED;
        exit;
+
                continue;
      }
+
            }
      else { // per-target state is LOCK_GRANTED
 
        if (# of queued operations to this target <= piggyback_threshold) {
 
          // do piggyback, queue up this operation
 
          if (run out of Operation Element resources)
 
            do RESTORE_OP_ELEM_RESOURCES;
 
          create Operation Element, queue it up in Operation Table;
 
          exit;
 
 
         }
 
         }
        else {
+
    } while (1);
          // don’t do piggyback
 
          issue all queued Operation Elements to this target;
 
          exit;
 
        }
 
      }
 
    }
 
 
   }
 
   }
  
 
   WIN_UNLOCK() {
 
   WIN_UNLOCK() {
     if (Target Element for this target does not exist) {
+
     if (target element for this target does not exist)
      // create Target Element for this target
+
        create Target Element and queue it up; // might need to restore target elements if none are available
      if (run out of Target Element resources)
+
      
        do RESTORE_TARGET_ELEM_RESOURCES;
+
     if (window state is LOCK_CALLED) { // single operation optimization
      create Target Element, queue it up in Operation Table;
 
      set per-target state to LOCK_GRANTED;
 
     }
 
    // now Target Element for this target exists in Operation Table
 
     if (global state is LOCK_CALLED) {
 
      // single operation optimization
 
 
       issue LOCK-OP-UNLOCK to this target;
 
       issue LOCK-OP-UNLOCK to this target;
       do PROGRESS_WAIT until RMA_ACK is received;
+
       do PROGRESS_WAIT until RMA_DONE_ACK is received;
 
       free this Target Element;
 
       free this Target Element;
       set global state to UNSET;
+
       set window state to UNSET;
       exit;
+
       return;
 
     }
 
     }
     if (global state is LOCK_ISSUED) {
+
     if (window state is LOCK_ISSUED) {
 
       do PROGRESS_WAIT until lock granted message is received;
 
       do PROGRESS_WAIT until lock granted message is received;
 
       set per-target state to LOCK_GRANTED;
 
       set per-target state to LOCK_GRANTED;
 
     }
 
     }
     if (global state is LOCK_GRANTED) {
+
     if (window state is LOCK_GRANTED) {
 
       do MAKE_RMA_PROGRESS to issue all operations in Operation Table;
 
       do MAKE_RMA_PROGRESS to issue all operations in Operation Table;
 
       issue (or piggyback) RMA_DONE+UNLOCK packet to target;
 
       issue (or piggyback) RMA_DONE+UNLOCK packet to target;
 
       do PROGRESS_WAIT until operation list for this target is empty; // all operations are completed;
 
       do PROGRESS_WAIT until operation list for this target is empty; // all operations are completed;
       do PROGRESS_WAIT until I received RMA_ACK packet from target;
+
       do PROGRESS_WAIT until I received RMA_DONE_ACK packet from target;
 
       free this Target Element;
 
       free this Target Element;
       set global state to UNSET;
+
       set window state to UNSET;
 
     }
 
     }
 
   }
 
   }
Line 379: Line 357:
 
== Algorithm for LOCK_ALL-UNLOCK_ALL ==
 
== Algorithm for LOCK_ALL-UNLOCK_ALL ==
  
One difference between algorithms for LOCK_ALL-UNLOCK_ALL and LOCK-UNLOCK is that, we don't create any Target Elements in WIN_LOCK_ALL call (we create them in operation calls), whereas we create Target Element in WIN_LOCK call. The reason is that, in WIN_LOCK_ALL, it is very likely that process does not talk to every other processes during the epoch (or even only talks with very small group of processes), so we want to avoid creating unnecessary Target Elements. However, in WIN_LOCK, it is most likely that the process will talk with this target in the epoch, so we create the Target Element in WIN_LOCK.
+
When the application issues a WIN_LOCK_ALL call, we set the window state to LOCK_ALL_CALLED. No additional messages are issued at this point.
  
 
We use two different protocols for LOCK_ALL-UNLOCK_ALL:
 
We use two different protocols for LOCK_ALL-UNLOCK_ALL:
  
(1) '''Per-target protocol''': When the number of targets I am concurrently talking with is small and Target Element resources are still available, we send lock request whenever we create a Target Element for one target. After that, we wait for the lock granted from that target, and then we start issuing operations. The global state is always LOCK_ALL_CALLED for this protocol.
+
(1) '''Per-target protocol''': When the application issues an RMA op to a target, we lazily issue a lock message to that target and set that target's state to LOCK_ISSUED.  This protocol has the advantage that no unnecessary lock messages are issued unless the origin talks to a target process.  However, this protocol only works till we have sufficient target element resources available.  Once we are out of target element resources, we cannot free an existing target element resource because once freed, if we see another RMA operation to that target, we cannot distinguish whether we issued the lock message to that target or not.
  
(2) '''Global protocol''': When the number of targets I am concurrently talking with becomes large, which makes Target Element resources be used up, we send lock requests to every processes on window (global state is changed to LOCK_ALL_ISSUED) and then wait for all of them to be granted. After it is finished, the global state is set to LOCK_ALL_GRANTED.
+
(2) '''Window protocol''': When the application runs out of target element resources in the '''per-target protocol''', we fall back to the '''window protocol''', where the window state is changed to LOCK_ALL_ISSUED, a lock operation is issued to all targets to whom a lock operation has not been issued yet, and we wait till the lock acknowledgments for all of these targets arrives. After this the window state is set to LOCK_ALL_GRANTED, at which point we hold a lock to all targets.  In this case, target element resources can be freed as needed.  While the '''window protocol''' adds more lock messages and more synchronization with processes, it is more scalable with respect to resources.
  
The reason of using two protocols is that, only using per-target protocol is not always correct. When we create a new Target Element, we don't know if the lock for that target has already been issued/granted or not. It may be the very first time to create Target Element for that target (lock is not issued/granted), or maybe we already created Target Element for that target before but just free it because of running out of resources (lock is already granted). Therefore we need global protocol to wait until lock on every processes to be granted. After that, when we create a new Target Element, we always set the per-target state of it to be LOCK_GRANTED. Only using global protocol is correct but will cause bad performance when number of active targets is small. If I am only talking with very small number of targets, we always need to wait for all locks to be granted at beginning, which causes unnecessary communication overhead. Therefore, we use two protocols in different situations: when number of active targets is small, we use per-target protocol; when number of active targets becomes large and Target Elements resources are used up, we switch to global protocol.  
+
Note that switching from the '''per-target''' protocol to the '''window''' protocol needs to be handled without requiring the allocation of additional operation or target elements. This can be handled by issuing a constant number of lock operations and keeping track of the associated requests in the function stack (in whichever function triggered the state change from LOCK_ALL_CALLED to LOCK_ALL_ISSUED). Once these requests are free, we issue the next set of requests till all locks are issued and granted.
  
Note that when WIN_LOCK with MODE_NO_CHECK is called, we directly go into global protocol (but don't need to wait for lock granted messages). The global state is set to LOCK_ALL_GRANTED at beginning.
+
Also note that when WIN_LOCK_ALL with MODE_NO_CHECK is called, we directly go into the '''window protocol''' (but do not need to wait for lock granted messages). The window state is set to LOCK_ALL_GRANTED immediately.
  
 
[[File:Lock-all-states.jpg]] [[File:Legends.jpg]]
 
[[File:Lock-all-states.jpg]] [[File:Legends.jpg]]
 
(Simplified version of pseudo-code)
 
  
 
   LOCK_ALL() {
 
   LOCK_ALL() {
 
     if (MPI_MODE_NO_CHECK)
 
     if (MPI_MODE_NO_CHECK)
       set global state to LOCK_ALL_GRANTED;
+
       set window state to LOCK_ALL_GRANTED;
 
     else
 
     else
       set global state to LOCK_ALL_CALLED;
+
       set window state to LOCK_ALL_CALLED;
 
   }
 
   }
  
 
   RMA_OPS() {
 
   RMA_OPS() {
     if (global state is LOCK_ALL_CALLED) {
+
     if (window state is LOCK_ALL_CALLED) { // per-target protocol
      if (no Target Element for this target) {
+
        if (target queue is available) {
        // create a Target Element for this target, set per-target state to LOCK_CALLED
+
            // follow LOCK/UNLOCK protocol
      }
+
            MPI_WIN_LOCK(target);
    }
+
            call RMA_OP on that target;
 
+
            return;
    if (global state is LOCK_ALL_GRANTED) {
+
        }
      if (no Target Element for this target) {
+
        else {
        // create a Target Element for this target, set per-target state to LOCK_GRANTED
+
            // fallback to window protocol
      }
+
            change state to LOCK_ALL_ISSUED
    }
+
            Issue lock to all targets to which it was not issued
 
+
            Wait for all lock acks to come back
    if (per-target state is LOCK_CALLED) {
+
            change state to LOCK_ALL_GRANTED
      // if single op optimization is satisfied, queue it up, exit
+
            if (#ops <= PIGGYBACK_THRESHOLD)
      // if single op optimization is not satisfied, queue it up, send lock request, set per-target state to LOCK_ISSUED, exit
+
              queue up operation and break;  // might need to wait for target and op elements to be available
    }
+
            else
 
+
              issue operation and break;        // if issue didn't complete, might need to queue up request
    if (per-target state is LOCK_ISSUED) {
+
        }
      // queue it up, exit
 
    }
 
 
 
    if (per-target state is LOCK_GRANTED) {
 
      // if piggyback_threshold is not reached, queue it up, exit
 
      // if piggyback_threshold is reached, call MAKE_RMA_PROGRESS
 
 
     }
 
     }
 
   }
 
   }
  
 
   UNLOCK_ALL () {
 
   UNLOCK_ALL () {
     while (Operation Table is not empty) {
+
     if (window state is LOCK_ALL_CALLED) { // per-target protocol
      call MAKE_RMA_PROGRESS; // issue as fast as possible (e.g. don't wait on one target(slot) to finish before starting the next target (slot))
+
        call MAKE_RMA_PROGRESS to issue all operations in active Target Elements; // for Targets that lock is not issued/granted, wait for lock to be granted;
       call PROGRESS_WAIT;
+
        issue (or piggyback) RMA_DONE+UNLOCK to active targets;
 +
        call PROGRESS_WAIT until all operations are completed;
 +
        call PROGRESS_WAIT until RMA_DONE_ACK messages from active targets are arrived;
 +
        free all Target Elements;
 +
        set window state to UNSET;
 +
    }
 +
 
 +
    if (window state is LOCK_ALL_GRANTED) {  // window protocol
 +
      call MAKE_RMA_PROGRESS to issue all operations in Operation Table;
 +
      issue (or piggyback) RMA_DONE+UNLOCK to every process on window;
 +
      call PROGRESS_WAIT until all operations are completed;
 +
       call PROGRESS_WAIT until RMA_DONE_ACK messages from all processes are arrived;
 +
      free all Target Elements;
 +
      set window state to UNSET;
 
     }
 
     }
    set global state to UNSET;
 
 
   }
 
   }
  
Line 441: Line 423:
 
== Algorithm for FLUSH ==
 
== Algorithm for FLUSH ==
  
WIN_FLUSH is roughly the same with WIN_UNLOCK, except that it doesn't issue (or piggyback) UNLOCK message, but only issue (or piggyback) RMA_DONE message. Also it does not change any global state.
+
WIN_FLUSH is roughly the same with WIN_UNLOCK, except that it doesn't issue (or piggyback) UNLOCK message, but only issues (or piggyback) RMA_DONE message. Also it does not change any window state.
  
== Performance optimization techniques ==
+
When WIN_FLUSH is used with the window state PER_TARGET, if there is no target element, it does not send an RMA_DONE packet because all operations for this target are guaranteed to be completed when the target element was freed.
  
1. Single operation optimization: TODO
+
When WIN_FLUSH is used with window state LOCK_ALL_CALLED / LOCK_ALL_GRANTED, if there is no target element, it will send an RMA_DONE packet and wait for an RMA_DONE_ACK packet, because RMA operations are not guaranteed to have completed on the target.
  
2. Piggyback RMA DONE packet with the last operation: TODO
 
  
 +
== Performance optimizations ==
  
== Correctness issues ==
+
# '''Single short operation optimization''': if there is only one operation between the WIN_LOCK and WIN_UNLOCK with a basic datatype and its size is smaller than single_op_opt_threshold, we piggyback both the LOCK and UNLOCK messages on this operation.  Specifically, we will not issue the lock request packet or the unlock packet.  Instead, we only send one packet at WIN_UNLOCK time which contains lock request, operation data and the unlock flag.  We need to wait for the acknowledgment to come back before returning from WIN_UNLOCK.
 +
# '''Piggyback RMA_DONE (UNLOCK) packet''': If there are GET operations in the operation list, we move them to the tail and piggyback the RMA_DONE (or UNLOCK) flag on this message.  In such cases, the data returned by the GET operation also piggybacks the RMA_DONE_ACK (or UNLOCK_ACK) flag.  If there is no GET operations in the operation list, we piggyback the RMA_DONE packet with the last operation and wait for a separate RMA_DONE_ACK packet to arrive. Specifically, we always store the tail of the operation list before reaching the ending synchronization and update it if necessary when we encounter a new RMA operation. When the tail is already a GET/GET_ACCUM/CAS/FOP operation, we do not need to update it anymore. When the tail is a PUT/ACC operation, and if we encounter a GET operation, we update the tail with that GET operation and do not need to update it in future; if we encounter a non-GET operation, we also update the tail with that operation because we need to guarantee the ordering of operations. (If the user application doesn't care about the RMA_DONE packet, they should be able to provide MPI info hints to tell runtime that not always waiting for RMA_DONE packet to come back.)
 +
# '''Lazy deallocation of target elements''':  In passive target communication, the target structure holds additional information such as the mode of the lock (e.g., MPI_MODE_NOCHECK).  This information allows us to decide whether to send an additional UNLOCK message or not.  However, if the target element is freed, this information is lost and we might need to fallback to a conservative model of sending UNLOCK packets even when they are not needed.  The lazy deallocation of target elements optimization allows us to keep these elements allocated even when all of the associated operations in the queue have been issued and freed.  The target element is only freed when we run out of resources or we close that epoch.
  
TODO: avoid deadlock by disabling piggyback.
+
Note that the piggyback optimization has the risk of deadlock, when every process queues up operations and used up all operation resources. However, using operation local pool can avoid such situation. When we use up all operation resources, we will call MAKE_RMA_PROGRESS_TARGET to issue out operations, and then call PROGRESS_WAIT until they are finished and we can get operation resources from the local pool. The free function of operation always first puts the operation to the local pool, when local pool is full, it will put the operation to the global pool.
  
 +
A key shortcoming of all of these performance optimizations is that they hold up resources that would typically only be freed by user intervention.  That is, without these optimizations, if a process waited in the progress engine, it can reclaim resources within a finite amount of time (assuming other processes in the system are making MPI calls).  However, with these optimizations, this might not be true.  For example, if the system has operation elements queued up waiting for an UNLOCK packet (in order to do the '''single short operation optimization''' mentioned above), the operation element will not be freed till the user issues the UNLOCK operation (or a flush operation).  To handle such situations, we need to distinguish cases of resource exhaustion in the implementation and temporarily disable appropriate optimizations to reclaim resources.
  
 
== RMA + threads ==
 
== RMA + threads ==
  
PROGRESS_WAIT may happen in some places in the design, we need to make sure that RMA with threads works correctly. Interleaving situations that need to be considered are as follows:
+
PROGRESS_WAIT can only happen in the main function calls associated with RMA operations and RMA synchronizations, and we need to make sure that those RMA calls work correctly with multithreading. Interleaving situations that need to be considered are as follows:
  
 
1. PUT/GET/ACC + PUT/GET/ACC
 
1. PUT/GET/ACC + PUT/GET/ACC
  
2. PUT/GET/ACC + FENCE (no MODE_NOSUCCEED and MODE_NOPRECEDE)
+
2. PUT/GET/ACC + WIN_FENCE (no MODE_NOSUCCEED and no MODE_NOPRECEDE)
 
 
3. PUT/GET/ACC + FLUSH
 
  
4. FLUSH + FLUSH
+
3. PUT/GET/ACC + WIN_FLUSH
  
Note that (1) PUT/GET/ACC + FENCE (with MODE_NOSUCCEED or MODE_NOPRECEDE), (2) PUT/GET/ACC + COMPLETE and (3) PUT/GET/ACC + UNLOCK/UNLOCK_ALL are not correct programs, so we don't need to deal with them.
+
4. WIN_FLUSH + WIN_FLUSH
  
'''Drop FLUSH state and enable multithreading in WIN_FLUSH''': in the previous RMA implementation, only one thread can issue from the operation list at one time. If that thread does not finish issuing and completing all operations, it will do PROGRESS_WAIT but will not yield to other threads. This is not efficient. In our design, we enable multithreading in WIN_FLUSH, therefore, when one thread is waiting for completion of operations, it can yield to other threads.
+
All other interleaving situations are invalid.
  
One problem of enabling multithreading in WIN_FLUSH is that, there is a FLUSH state in the original code, which now can be modified by multiple threads, leading to wrong execution. To drop FLUSH state, we add a response request in the last operation. To complete that last operation, we need to wait for RMA_ACK packet to arrive and complete that response request.
+
TODO: needs more careful and detailed definition and clarification on threads interleaving.
  
 +
'''Enabling multithreading in WIN_FLUSH''': To achieve this, we piggyback the FLUSH_DONE (i.e. RMA_DONE) with the last packet. Specifically, each thread (suppose T1) creates a response request on the origin and passes its handle to the last packet. The completion of that response request is handed over to the progress engine, similar to other operations. The response request is completed and freed on the origin only when the FLUSH_DONE_ACK (i.e. RMA_DONE_ACK) packet arrives, which indicates the completion of WIN_FLUSH on thread T1. Note that the thread that creates the response request (T1) and the thread that completes the response request (suppose T2) are not necessarily the same. As long as the response request created by T1 gets completed, WIN_FLUSH on T1 can return.
  
 
== RMA + shared memory ==
 
== RMA + shared memory ==
Line 549: Line 533:
 
3. '''Passive target synchronization'''
 
3. '''Passive target synchronization'''
  
Note that for passive target synchronization, when one process wants to access its own window memory, it must do those accesses within an passive epoch (make local loads/stores become SHM operations), otherwise those local accesses may be reordered and leads to wrong results (clarification is needed for this in MPI specification). See following for details.
+
Note that for passive target synchronization, when one process wants to access its own window memory, it must do those accesses within an passive epoch (make local loads/stores become SHM operations), or using MPI_WIN_SYNC before and after local loads/stores, otherwise those local accesses may be reordered and leads to wrong results (clarification is needed for this in MPI specification). See following for details.
  
 
               RANK 0                          RANK 1
 
               RANK 0                          RANK 1
Line 630: Line 614:
  
 
The MEM_SYNC in WIN_FLUSH guarantees that all SHM operations before this WIN_FLUSH will happen before any PROC_SYNC with target after this WIN_FLUSH, which means SHM operations on target process after PROC_SYNC with origin will see the latest data.
 
The MEM_SYNC in WIN_FLUSH guarantees that all SHM operations before this WIN_FLUSH will happen before any PROC_SYNC with target after this WIN_FLUSH, which means SHM operations on target process after PROC_SYNC with origin will see the latest data.
 +
 +
User can also use MPI_WIN_SYNC in passive target synchronization to guarantee the ordering of load/store operations.
 +
 +
              RANK 0                          RANK 1
 +
   
 +
                                          WIN_SYNC {
 +
                                              MEM_SYNC
 +
                                          }
 +
 
 +
                                        (local loads/stores)
 +
 
 +
                                          WIN_SYNC {
 +
                                              MEM_SYNC
 +
                                          }
 +
 
 +
        PROC_SYNC -------------------- PROC_SYNC
 +
 
 +
        WIN_LOCK (target=1) {
 +
            PROC_SYNC (lock granted)
 +
            MEM_SYNC
 +
        }
 +
 
 +
        (SHM operations)
 +
 
 +
        WIN_UNLOCK (target=1) {
 +
            MEM_SYNC
 +
            PROC_SYNC (lock released)
 +
        }
 +
 
 +
        PROC_SYNC -------------------- PROC_SYNC
 +
 
 +
                                        WIN_SYNC {
 +
                                            MEM_SYNC
 +
                                        }
 +
 
 +
                                        (local loads/stores)
 +
 
 +
                                        WIN_SYNC {
 +
                                            MEM_SYNC
 +
                                        }
 +
 +
== Reduce O(p) structure on window ==
 +
 +
This includes sharing window information within the node and local cache for remote window information.
 +
 +
(From Prof. Gropp's comments)
 +
 +
# Detect common arguments (e.g., the same displacement unit or size used on all processes).  Possible enhancement: consider as an integer map, and share the group compression code.  Ditto other data.
 +
# Based on CH3 channel needs, determine whether offsets are needed for all process or just the local process.  In the case where no direct RMA is available, there is no need for all processes to hold
 +
the offsets.
 +
# For data that must be available, consider caching and/or accessing with one-sided operations (e.g., only hold the window base addresses for the windows that you are accessing; have a way to handle a cache "miss").
 +
# Related to 3, store shared information in a single, shared data structure for all processes on the same SMP.  Consider extending this approach to other MPI objects.
 +
 +
== Datatype in RMA ==
 +
 +
(From Prof. Gropp's comments)
 +
 +
The current RMA code doesn’t interface correctly to the dataloop/datatype code. The problem is that it apparently expects a “flattened” version of the dataloop, which can be terrible for performance. At the least, we should consider the following:
 +
 +
# Contiguous must be very fast
 +
# Strided should be fast (e.g., vector types); this probably means a special case for this
 +
# Datatypes used at the target may be cached and do not need to be sent every time
 +
## Caches can be shared on nodes.
 +
## Caches need to flush data, so it may be necessary to refresh a cached item
 +
# Non contiguous data should be pipelined if large enough, rather than packed first and then sent in a contiguous lump.
 +
# This must be integrated with one-sided support in the interconnect hardware.
 +
 +
== Ordering ==
 +
 +
MPICH provides a netmod API (get_ordering) to allow netmod to expose the ordering of Active-Messages-based (AM-based) operations ('''am_ordering'''). A netmod may issue some packets via multiple connections in parallel (such as RMA), and those packets can be unordered. In such case, the netmod should return 0 from get_ordering function; otherwise, the netmod returns 1 from get_ordering function. The default value of '''am_ordering''' in MPICH is 0. Setting it to 1 may improve the performance of runtime because it does not need to maintain ordering among AM-based operations. One example of implementing the API in MXM netmod is as follows:
 +
 +
<source lang="c">
 +
  int MPID_nem_mxm_get_ordering(int *ordering)
 +
  {
 +
    (*ordering) = 1;
 +
    return MPI_SUCCESS;
 +
  }
 +
</source>
 +
 +
When '''am_ordering''' is 1, the MPICH runtime only issues a REQ_ACK at last when it wants to detect remote completion of all operations; when '''am_ordering''' is 0, the runtime needs to issue a REQ_ACK with every issued operations to ensure that any future remote completion detection will be correct.

Latest revision as of 20:30, 25 September 2015



Goals

The goals of the RMA infrastructure are as follows:

  1. No usage of O(P) data structures, where 'P' is the number of processes. Data structure sizes should be constant and user-configurable. General rule is that the user should be able to use more memory to achieve better performance.
  2. Reduce redundant messages where possible (e.g., extra acknowledgment packets).
  3. Reduce process synchronization where possible (e.g., FENCE need not always be synchronizing).
  4. Ensure code is thread-safe (e.g., avoid thread-private data structures).
  5. Ensure integration of RMA code into the overall MPICH progress engine.
  6. Use hardware support for RMA.
  7. Resource management to prevent using up all resources.

Data structures

1. Overview:

We use a new 3D data structure, Operation Table, to store posted RMA operations.

Op-list-slots.jpg

There are the following kinds of data structures involved:

(1) Operation ElementCircle.jpg: contains all origin/target information needed for this RMA operation, plus a request handle that is set when the operation is issued (and has not completed immediately). When the user posts a new operation, the runtime creates and enqueues a new operation structure to the corresponding operation list; when this operation is completed, the runtime dequeues and frees the operation structure from the list.

If req is a NULL pointer, this operation is a PENDING operation; if req is a non-NULL pointer, this operation is a ISSUED operation.

Note that currently operation list is a double-direction linked list, the prev pointer of head points to the tail, and the next pointer of tail is NULL. We may need to make op list as a single-direction linked list, and store tail pointer in target element.

 struct OP_ELEM {
   all OP informations;
   MPID_Request *req;
   struct OP_ELEM *next, *prev;
 }

(From Prof. Gropp's comments) We may need to consider lightweight request design for RMA. Current version of request has higher overhead. For example, we should not always initialize/erase all areas when creating/freeing a request, but should only initialize/erase areas we needed.

(2) Target ElementRhombus.jpg: contains pointer to an operation list that stores all RMA operations to the same target. It also contains the epoch state of the target in cases where a per-target epoch is used (details below). When the origin first communicates with a target, MPICH creates and enqueues a new target element to the corresponding target list. When the origin finishes communication with that target, or all internal resources for targets are used up, MPICH dequeues and frees the target element.

 struct TARGET_ELEM {
   struct OP_ELEM *op_list;
   struct OP_ELEM *next_op_to_issue;
   stuct TARGET_ELEM *next;
   int rank;
   int ack_counter;   // for passive target
   int op_count;
   int sync_flag; // UNLOCK/FLUSH/FLUSH_LOCAL, for piggybacking
   int lock_state; // LOCK_CALLED/LOCK_ISSUED/LOCK_GRANTED 
   int lock_type; // SHARED/EXCLUSIVE
   int lock_mode; // MODE_NO_CHECK
 }

(3) SlotSlot.jpg: contains pointer to a target list. Distribution of targets among slots is currently round-robin. During window creation, MPICH allocates a fixed-sized slot array on the window (size of slot array is controllable by the user).

 struct SLOT {
   struct TARGET_ELEM * targets_list;
 }

(4) WindowWin.jpg: contains at least the following information:

 struct WIN {
   int win_state;  // window state
   int mode;  // for MODE_NOCHECK for LOCKALL case
   int ack_counter;   // for active target
   int num_nonempty_slots;   // how many slots have non-NULL target list to avoid traversing slots
 }


2. Performance issues

For every RMA operation routine, MPICH needs to search in the corresponding target list to find the correct target, which may introduce some overhead when posting operations. However, with a fixed size of the slot array, the overhead of looking up is linear with the number of targets the origin is actively communicating with. For applications that only communicate with a small number of processes, the operation posting time is usually constant. But if the application is communicating with a large number of processes or with more than one process that map to the same slot, there is a linear lookup for the target list within the slot, which can cause overhead.

3. Potential benefits

  1. Operations to different targets are not mixed in a single list, but are stored in separate lists. This has many benefits. For example, for single target epochs, we can judge if all operations to a target have completed (e.g., during a FLUSH operation), and we can find the last operation to a target easily without counting the total number of operations. For multiple targets epoch, we can optimize the garbage collection function by stopping the search in the current target's operation list if we meet incomplete operations (since it is likely that the following operations in that list are also incomplete) and jump to another target's operation list.
  2. Scalability: the size of slots array is fixed and set by the user and is scalable when handling large number of processes on the window.
  3. The garbage collection function can easily distinguish between operations that are not issued and operations that are issued but not completed yet.

Operation and Target Element Pools

We use two-level element pools for the operation elements and the target elements: window (local) and global. The local pool resources can only be used by operations within the window, while the global pool resources are shared by all windows. The sizes of both the local and global portions of each pool are fixed and can be configured by the user at runtime.

When a new element is needed, we first check if one is available in the corresponding local pool. If the local pool is empty, we try to find one in the global pool. If even the global pool is empty, we will call the CLEANUP_WIN_AGGRESSIVE function (see Basic routines) to free up existing elements. When we are done using an element, we first return it to the local pool. If the local pool is full, we return it to the global pool.

This model ensures that a window can never be starved because other windows are using up all the resources.

TODO: We should investigate allocating op/target pools on shared memory, explore if we can make progress on other processes.

Epoch States

RMA epoch states are of two forms: window states and per-target states. Window states are states that affect all targets on the window, while per-target states only affect the individual target they are set for.

There are nine window states:

  1. MPIDI_RMA_NONE
  2. MPIDI_RMA_FENCE_ISSUED
  3. MPIDI_RMA_FENCE_GRANTED
  4. MPIDI_RMA_PSCW_ISSUED
  5. MPIDI_RMA_PSCW_GRANTED
  6. MPIDI_RMA_LOCK_ALL_CALLED
  7. MPIDI_RMA_LOCK_ALL_ISSUED
  8. MPIDI_RMA_LOCK_ALL_GRANTED
  9. MPIDI_RMA_PER_TARGET

There are three per-target states:

  1. MPIDI_RMA_LOCK_CALLED
  2. MPIDI_RMA_LOCK_ISSUED
  3. MPIDI_RMA_LOCK_GRANTED

Note that per-target state is used only when window state is either MPIDI_RMA_LOCK_ALL_CALLED or MPIDI_RMA_PER_TARGET.

Basic routines

1. ISSUE_OPS_TARGET / ISSUE_OPS_WIN / ISSUE_OPS_GLOBAL: Nonblocking call. They try to issue all pending operations for one target/one window/all windows as many as possible. They return number of operations issued.

Note that for ISSUE_OPS_WIN, the function will issue operations in a round-robin fashion to prevent overloading one target with lots of messages at once.

2. CLEANUP_TARGET / CLEANUP_WIN / CLEANUP_GLOBAL: Nonblocking call. They try to find completed operations and targets and clean them up for one target/one window/all windows as many as possible. They return number of operations + targets cleaned up. For active target, this function also cleans up empty target elements. Note that for passive target, empty target functions are not cleaned up by this function and need to be cleaned up by the corresponding packet handler. This is because in passive target, the user might issue a per-target FLUSH operation as well, in which case we need to know how many flush acknowledgments we are waiting for from a particular target.

3. MAKE_RMA_PROGRESS_TARGET / MAKE_RMA_PROGRESS_WIN / MAKE_RMA_PROGRESS_GLOBAL: Nonblocking call. They try to issue pending operations and clean up issued operations for one target/one window/all windows as many as possible.

 MAKE_RMA_PROGRESS_GLOBAL () {
   do {
     x = ISSUE_OPS_GLOBAL();
     y = CLEANUP_OPS_GLOBAL();
   } while (x+y != 0);
 }

MAKE_RMA_PROGRESS_GLOBAL is called from progress engine; MAKE_RMA_PROGRESS_WIN is called in multiple target cases (FENCE/PSCW/LOCK_ALL/FLUSH_ALL); MAKE_RMA_TARGET is called in single target cases (LOCK/FLUSH).

4. PROGRESS_WAIT: it is a blocking call. It keeps poking progress engine until a completion signal is caught in the progress engine. Note that for multithreading case, one thread which is blocked in the function may yield CPU to other threads.

5. CLEANUP_WIN_AGGRESSIVE: it is a blocking call. It tries to make progress and poke progress engine until there are resources available.

 CLEANUP_WIN_AGGRESSIVE(resource_type) {
   if (window_state == FENCE_ISSUED) {
     do PROGRESS_WAIT until the resource is available or window_state == FENCE_GRANTED;
     if (resource is available)
         return;
   }
   
   if (window_state == PSCW_ISSUED) {
     do PROGRESS_WAIT until the resource is available or window_state == PSCW_GRANTED;
     if (resource is available)
         return;
   }
   
   if (window_state == LOCK_ALL_ISSUED) {
     do PROGRESS_WAIT until the resource is available or window_state == LOCK_ALL_GRANTED;
     if (resource is available)
         return;
   }
   
   if (window state == FENCE_GRANTED || window state == PSCW_GRANTED) {
     pick one Target Element K;   // might differ based on resource type
     K->sync_flag = FLUSH_LOCAL;
     call PROGRESS_WAIT until the resource is available;
     return;
   }
   
   if (window state == PER_TARGET || window_state == LOCK_ALL_CALLED || window_state == LOCK_ALL_GRANTED) {
     pick one Target Element K in priority order LOCK_GRANTED, LOCK_ISSUED, LOCK_CALLED;  // might differ based on resource type
     if (resource_type == op element)
         K->sync_flag = FLUSH_LOCAL;
     else   // target element is only freed in the packet handler for passive target
         K->sync_flag = FLUSH;
     call PROGRESS_WAIT until the resource is available;
     return;
   }
 }

Note that for multiple target cases (FENCE/PSCW) we only do a FLUSH_LOCAL because we can track the total number of RMA_DONE packets in the window (win_ptr->ack_counter) and wait for all of them in closing synchronization calls. However, we cannot do the same thing for passive target epochs because we need to know how many RMA_DONE_ACK packets we are expecting from a particular target if the user calls a FLUSH operation.


Algorithm for FENCE

Fence-states.jpg Legends.jpg

Note that there is a request pointer on window called fence_req which stores request of IBARRIER. For details on the memory barrier semantics, see RMA + shared memory.

 FENCE() {
   // If a previous request exists, clean it up
   if (win_ptr->fence_req != NULL) {  // A fence has been previously issued
     if (MPI_MODE_NOPRECEDE) {
       decrement ref count of fence_req by 1;  // let the progress engine delete this request
       set win_ptr->fence_req to NULL;
     }
     else {
       do PROGRESS_WAIT until previous IBARRIER has completed;  // fence_req on window is deleted now
       free win_ptr->fence_req and set it to NULL;
       set window state to FENCE_GRANTED;
     }
   }
   
   // Synchronize as needed
   if (MPI_MODE_NOPRECEDE) { // No operations to complete
     if (SHM is allocated) perform a local BARRIER;  // So that we can issue local operations immediately
     do a global IBARRIER, store the request to fence_req on window;
     set window state to FENCE_ISSUED;
   }
   else { // some operations need to be completed
     do MAKE_RMA_PROGRESS to issue all operations in Operation Table;
     issue (or piggyback) RMA_DONE packets to all active targets and increment ack_counter;
     do PROGRESS_WAIT until Operation Table is empty; // all operations are completed;
     do PROGRESS_WAIT until I received RMA_DONE_ACK packets from all active targets (decrement ack_counter, wait until it reaches 0);
     do a global IBARRER;
     set window state to FENCE_ISSUED; // I am done with all my outgoing messages
     do PROGRESS_WAIT until IBARRIER is completed;  // fence_req on window is freed now
     set window state to FENCE_GRANTED; // I know that everyone else is done
   }
   if (MPI_MODE_NOSUCCEED) { // No operations will start after this
     set window state to UNSET;
   }
 }
 RMA_OP() {
   // Issue local operations immediately
   if (SHM is allocated && target is on the same node) {
       issue operation directly
       return;
   }
   
   // Non-local operation
   do {
       if (window state is FENCE_GRANTED) {
           if (#ops <= PIGGYBACK_THRESHOLD)
               queue up operation and break;  // might need to wait for target and op elements to be available
           else
               issue operation and break;  // if issue didn't complete, might need to queue up request
       }
       else if (window state is FENCE_ISSUED) {
           do progress wait till IBARRIER has completed || (we have a target element && a op element);
           if (IBARRIER completed) {
               set window state to FENCE_GRANTED
               continue;
           }
           else {
               queue up operation and break;  // might need to wait for target and op elements to be available
           }
       }
    } while (1);
 }

Note that if the very last FENCE is not called with MPI_MODE_NOSUCCEED, the window state is still FENCE_GRANTED, other RMA synchronizations will directly transit the window state to correct states; if the very last FENCE is called with MPI_MODE_NOPRECEDE, the window state is still FENCE_ISSUED, other RMA synchronizations will also directly transit the window state to correct states, however, fence_req on window is not freed yet, which will be freed either when we meet another MPI_WIN_FENCE or when we reach MPI_WIN_FREE.


Algorithm for POST-START-COMPLETE-WAIT

Pscw-states.jpg Legends.jpg

The algorithm for PSCW is roughly the same with the algorithm for FENCE, except for the following points:

(1) Every place we did a IBARRIER in the FENCE algorithm, we now do ISENDs (on the target) and IRECVs (on the origin) among the group of processes calling PSCW.

(2) In WIN_POST, if SHM is allocated, the targets wait until all local ISENDs have completed; similarly, in WIN_START, if SHM is allocated, the origins wait until all local IRECVs have completed.

(3) Every place we waited for the IBARRIER request to complete in the FENCE algorithm, we now wait for an array of requests (from ISENDs/IRECVs) to complete instead. The request array is allocated on the window. The size of this array is as large as the process group size used in PSCW, and can be large if the application uses PSCW with a large number of processes. However, such applications should be using FENCE instead of PSCW.

(From Prof. Gropp's comments) For middle scale applications, it is not good to always do direct send/recv for synchronization. We should consider algorithms (like tree-based algorithm) to optimize such cases.

Note that all states are only for origin processes, for target processes, the state is always MPIDI_RMA_NONE.

Algorithm for LOCK-UNLOCK

In this algorithm, a target element might be created in two situations: (1) in the WIN_LOCK call the target element might be created for the first time (it is created in the LOCK_CALLED stated), or (2) in an RMA operation or WIN_UNLOCK call the target element might need to be recreated because it was cleaned up when someone else needed the resource (it is created in the LOCK_GRANTED state).

When the target element is recreated in the second case above, we lose the lock information such as the type of lock (SHARED/EXCLUSIVE) and mode (e.g., MODE_NOCHECK) for this target. The lock type (SHARED/EXCLUSIVE) information is not necessary because we do not need it once we are in the LOCK_GRANTED state. However the lack of the lock mode (MODE_NOCHECK) can hurt performance. This is because, when MODE_NOCHECK is set, we do not acquire the lock and directly set the epoch state to LOCK_GRANTED. However, since we lost this information in the recreation, we might end up sending an UNLOCK packet even when the target was not explicitly locked. In this case, the target is expected to process the UNLOCK message and send back an acknowledgment to the origin.

Note that we free the target element only after we sent the RMA_DONE message and received the RMA_DONE_ACK message from that target.

Lock-states.jpg Legends.jpg

WIN_LOCK() {
   set window state to PER_TARGET;
   if (target_rank == my_rank) {
     do PROGRESS_WAIT until I got the lock;
     return;
   }
   if (SHM is allocated && target is a local process) {
     send lock request message;
     do PROGRESS_WAIT until receiving lock granted message;
     return;
   }
   create Target Element and queue it up;  // might need to restore target elements if none are available
   if (MPI_MODE_NOCHECK)
     set per-target state to LOCK_GRANTED;
   else
     set per-target state to LOCK_CALLED;
 }
 RMA_OP() {
   // Issue local operations immediately
   if (SHM is allocated && target is on the same node) {
       issue operation directly
       return;
   }
   
   // Non-local operation
   do {
       if (target queue does not exist || target state is LOCK_GRANTED) {
           if (#ops <= PIGGYBACK_THRESHOLD)
               queue up operation and break;  // might need to wait for target and op elements to be available
           else
               issue operation and break;        // if issue didn't complete, might need to queue up request
       }
       else if (target state is LOCK_ISSUED) {
           do progress wait till state becomes LOCK_GRANTED || (we have a target element && a op element);
           if (state is LOCK_GRANTED) {
               continue;
           }
           else {
               queue up operation and break;  // might need to wait for target and op elements to be available
           }
       }
       else if (target state is LOCK_CALLED) {
           if (op list is empty && basic datatype && size <= single_op_threshold)
               queue up operation and break;  // might need to wait for target and op elements to be available
           else {
               issue lock operation;
               set state to LOCK_ISSUED;
               continue;
           }
       }
    } while (1);
 }
 WIN_UNLOCK() {
   if (target element for this target does not exist)
       create Target Element and queue it up;  // might need to restore target elements if none are available
   
   if (window state is LOCK_CALLED) {  // single operation optimization
     issue LOCK-OP-UNLOCK to this target;
     do PROGRESS_WAIT until RMA_DONE_ACK is received;
     free this Target Element;
     set window state to UNSET;
     return;
   }
   if (window state is LOCK_ISSUED) {
     do PROGRESS_WAIT until lock granted message is received;
     set per-target state to LOCK_GRANTED;
   }
   if (window state is LOCK_GRANTED) {
     do MAKE_RMA_PROGRESS to issue all operations in Operation Table;
     issue (or piggyback) RMA_DONE+UNLOCK packet to target;
     do PROGRESS_WAIT until operation list for this target is empty; // all operations are completed;
     do PROGRESS_WAIT until I received RMA_DONE_ACK packet from target;
     free this Target Element;
     set window state to UNSET;
   }
 }


Algorithm for LOCK_ALL-UNLOCK_ALL

When the application issues a WIN_LOCK_ALL call, we set the window state to LOCK_ALL_CALLED. No additional messages are issued at this point.

We use two different protocols for LOCK_ALL-UNLOCK_ALL:

(1) Per-target protocol: When the application issues an RMA op to a target, we lazily issue a lock message to that target and set that target's state to LOCK_ISSUED. This protocol has the advantage that no unnecessary lock messages are issued unless the origin talks to a target process. However, this protocol only works till we have sufficient target element resources available. Once we are out of target element resources, we cannot free an existing target element resource because once freed, if we see another RMA operation to that target, we cannot distinguish whether we issued the lock message to that target or not.

(2) Window protocol: When the application runs out of target element resources in the per-target protocol, we fall back to the window protocol, where the window state is changed to LOCK_ALL_ISSUED, a lock operation is issued to all targets to whom a lock operation has not been issued yet, and we wait till the lock acknowledgments for all of these targets arrives. After this the window state is set to LOCK_ALL_GRANTED, at which point we hold a lock to all targets. In this case, target element resources can be freed as needed. While the window protocol adds more lock messages and more synchronization with processes, it is more scalable with respect to resources.

Note that switching from the per-target protocol to the window protocol needs to be handled without requiring the allocation of additional operation or target elements. This can be handled by issuing a constant number of lock operations and keeping track of the associated requests in the function stack (in whichever function triggered the state change from LOCK_ALL_CALLED to LOCK_ALL_ISSUED). Once these requests are free, we issue the next set of requests till all locks are issued and granted.

Also note that when WIN_LOCK_ALL with MODE_NO_CHECK is called, we directly go into the window protocol (but do not need to wait for lock granted messages). The window state is set to LOCK_ALL_GRANTED immediately.

Lock-all-states.jpg Legends.jpg

 LOCK_ALL() {
   if (MPI_MODE_NO_CHECK)
     set window state to LOCK_ALL_GRANTED;
   else
     set window state to LOCK_ALL_CALLED;
 }
 RMA_OPS() {
   if (window state is LOCK_ALL_CALLED) {  // per-target protocol
       if (target queue is available) {
           // follow LOCK/UNLOCK protocol
           MPI_WIN_LOCK(target);
           call RMA_OP on that target;
           return;
       }
       else {
           // fallback to window protocol
           change state to LOCK_ALL_ISSUED
           Issue lock to all targets to which it was not issued
           Wait for all lock acks to come back
           change state to LOCK_ALL_GRANTED
           if (#ops <= PIGGYBACK_THRESHOLD)
              queue up operation and break;  // might need to wait for target and op elements to be available
           else
              issue operation and break;        // if issue didn't complete, might need to queue up request
       }
   }
 }
 UNLOCK_ALL () {
   if (window state is LOCK_ALL_CALLED) {  // per-target protocol
       call MAKE_RMA_PROGRESS to issue all operations in active Target Elements; // for Targets that lock is not issued/granted, wait for lock to be granted;
       issue (or piggyback) RMA_DONE+UNLOCK to active targets;
       call PROGRESS_WAIT until all operations are completed;
       call PROGRESS_WAIT until RMA_DONE_ACK messages from active targets are arrived;
       free all Target Elements;
       set window state to UNSET;
   }
 
   if (window state is LOCK_ALL_GRANTED) {  // window protocol
     call MAKE_RMA_PROGRESS to issue all operations in Operation Table;
     issue (or piggyback) RMA_DONE+UNLOCK to every process on window;
     call PROGRESS_WAIT until all operations are completed;
     call PROGRESS_WAIT until RMA_DONE_ACK messages from all processes are arrived;
     free all Target Elements;
     set window state to UNSET;
   }
 }


Algorithm for FLUSH

WIN_FLUSH is roughly the same with WIN_UNLOCK, except that it doesn't issue (or piggyback) UNLOCK message, but only issues (or piggyback) RMA_DONE message. Also it does not change any window state.

When WIN_FLUSH is used with the window state PER_TARGET, if there is no target element, it does not send an RMA_DONE packet because all operations for this target are guaranteed to be completed when the target element was freed.

When WIN_FLUSH is used with window state LOCK_ALL_CALLED / LOCK_ALL_GRANTED, if there is no target element, it will send an RMA_DONE packet and wait for an RMA_DONE_ACK packet, because RMA operations are not guaranteed to have completed on the target.


Performance optimizations

  1. Single short operation optimization: if there is only one operation between the WIN_LOCK and WIN_UNLOCK with a basic datatype and its size is smaller than single_op_opt_threshold, we piggyback both the LOCK and UNLOCK messages on this operation. Specifically, we will not issue the lock request packet or the unlock packet. Instead, we only send one packet at WIN_UNLOCK time which contains lock request, operation data and the unlock flag. We need to wait for the acknowledgment to come back before returning from WIN_UNLOCK.
  2. Piggyback RMA_DONE (UNLOCK) packet: If there are GET operations in the operation list, we move them to the tail and piggyback the RMA_DONE (or UNLOCK) flag on this message. In such cases, the data returned by the GET operation also piggybacks the RMA_DONE_ACK (or UNLOCK_ACK) flag. If there is no GET operations in the operation list, we piggyback the RMA_DONE packet with the last operation and wait for a separate RMA_DONE_ACK packet to arrive. Specifically, we always store the tail of the operation list before reaching the ending synchronization and update it if necessary when we encounter a new RMA operation. When the tail is already a GET/GET_ACCUM/CAS/FOP operation, we do not need to update it anymore. When the tail is a PUT/ACC operation, and if we encounter a GET operation, we update the tail with that GET operation and do not need to update it in future; if we encounter a non-GET operation, we also update the tail with that operation because we need to guarantee the ordering of operations. (If the user application doesn't care about the RMA_DONE packet, they should be able to provide MPI info hints to tell runtime that not always waiting for RMA_DONE packet to come back.)
  3. Lazy deallocation of target elements: In passive target communication, the target structure holds additional information such as the mode of the lock (e.g., MPI_MODE_NOCHECK). This information allows us to decide whether to send an additional UNLOCK message or not. However, if the target element is freed, this information is lost and we might need to fallback to a conservative model of sending UNLOCK packets even when they are not needed. The lazy deallocation of target elements optimization allows us to keep these elements allocated even when all of the associated operations in the queue have been issued and freed. The target element is only freed when we run out of resources or we close that epoch.

Note that the piggyback optimization has the risk of deadlock, when every process queues up operations and used up all operation resources. However, using operation local pool can avoid such situation. When we use up all operation resources, we will call MAKE_RMA_PROGRESS_TARGET to issue out operations, and then call PROGRESS_WAIT until they are finished and we can get operation resources from the local pool. The free function of operation always first puts the operation to the local pool, when local pool is full, it will put the operation to the global pool.

A key shortcoming of all of these performance optimizations is that they hold up resources that would typically only be freed by user intervention. That is, without these optimizations, if a process waited in the progress engine, it can reclaim resources within a finite amount of time (assuming other processes in the system are making MPI calls). However, with these optimizations, this might not be true. For example, if the system has operation elements queued up waiting for an UNLOCK packet (in order to do the single short operation optimization mentioned above), the operation element will not be freed till the user issues the UNLOCK operation (or a flush operation). To handle such situations, we need to distinguish cases of resource exhaustion in the implementation and temporarily disable appropriate optimizations to reclaim resources.

RMA + threads

PROGRESS_WAIT can only happen in the main function calls associated with RMA operations and RMA synchronizations, and we need to make sure that those RMA calls work correctly with multithreading. Interleaving situations that need to be considered are as follows:

1. PUT/GET/ACC + PUT/GET/ACC

2. PUT/GET/ACC + WIN_FENCE (no MODE_NOSUCCEED and no MODE_NOPRECEDE)

3. PUT/GET/ACC + WIN_FLUSH

4. WIN_FLUSH + WIN_FLUSH

All other interleaving situations are invalid.

TODO: needs more careful and detailed definition and clarification on threads interleaving.

Enabling multithreading in WIN_FLUSH: To achieve this, we piggyback the FLUSH_DONE (i.e. RMA_DONE) with the last packet. Specifically, each thread (suppose T1) creates a response request on the origin and passes its handle to the last packet. The completion of that response request is handed over to the progress engine, similar to other operations. The response request is completed and freed on the origin only when the FLUSH_DONE_ACK (i.e. RMA_DONE_ACK) packet arrives, which indicates the completion of WIN_FLUSH on thread T1. Note that the thread that creates the response request (T1) and the thread that completes the response request (suppose T2) are not necessarily the same. As long as the response request created by T1 gets completed, WIN_FLUSH on T1 can return.

RMA + shared memory

When SHM is allocated for RMA window, we need to add memory berriers at proper places in RMA synchronization routines to guarantee the ordering of read/write operations, so that any operations after synchronization calls will see the correct data.

There are four kinds of operations involved in the following explanation:

(1) Local loads/stores: any operations happening outside RMA epoch and accessing each process's own window memory.

(2) SHM operations: any operations happening inside RMA epoch. They may access any processes' window memory, which include direct loads/stores, and RMA operations that are internally implemented as direct loads/stores in MPI implementation.

(3) PROC_SYNC: synchronizations among processes by sending/receiving messages.

(4) MEM_SYNC: a full memory barrier. It ensures the ordering of read/write operations on each process.


Followings are explanations about when memory barriers should be called in different RMA synchronizations.

1. FENCE synchronization

              RANK 0                           RANK 1
     
      (local loads/stores)            (local loads/stores)
     
          WIN_FENCE {                      WIN_FENCE {
              MEM_SYNC                       MEM_SYNC
              PROC_SYNC -------------------- PROC_SYNC
              MEM_SYNC                       MEM_SYNC
          }                              }
   
       (SHM operations)                (SHM operations)
   
          WIN_FENCE {                    WIN_FENCE {
              MEM_SYNC                       MEM_SYNC
              PROC_SYNC -------------------- PROC_SYNC
              MEM_SYNC                       MEM_SYNC
          }                              }
   
     (local loads/stores)             (local loads/stores)

We need MEM_SYNC before and after PROC_SYNC for both starting WIN_FENCE and ending WIN_FENCE, to ensure the ordering between local loads/stores and PROC_SYNC in starting WIN_FENCE (and vice versa in ending WIN_FENCE), and the ordering between PROC_SYNC and SHM operations in starting WIN_FENCE (and vice versa for ending WIN_FENCE).

In starting WIN_FENCE, the MEM_SYNC before PROC_SYNC essentially exposes previous local loads/stores to other processes; after PROC_SYNC, each process knows that everyone else already exposed their local loads/stores; the MEM_SYNC after PROC_SYNC ensures that my following SHM operations will happen after PROC_SYNC and will see the latest data on other processes.

In ending WIN_FENCE, the MEM_SYNC before PROC_SYNC essentially exposes previous SHM operations to other processes; after PROC_SYNC, each process knows everyone else already exposed their SHM operations; the MEM_SYNC after PROC_SYNC ensures that my following local loads/stores will happen after PROC_SYNC and will see the latest data in my memory region.

2. POST-START-COMPLETE-WAIT synchronization

             RANK 0                        RANK 1
  
                                    (local loads/stores)
          
          WIN_START {                  WIN_POST {
                                           MEM_SYNC
              PROC_SYNC ------------------ PROC_SYNC
              MEM_SYNC
          }                             }
          
        (SHM operations)
  
          WIN_COMPLETE {              WIN_WAIT/TEST {
              MEM_SYNC
              PROC_SYNC ----------------- PROC_SYNC
                                          MEM_SYNC
          }                             }
    
                                    (local loads/stores)

We need MEM_SYNC before PROC_SYNC for WIN_POST and WIN_COMPLETE, and MEM_SYNC after PROC_SYNC in WIN_START and WIN_WAIT/TEST, to ensure the ordering between local loads/stores and PROC_SYNC in WIN_POST (and vice versa in WIN_WAIT/TEST), and the ordering between PROC_SYNC and SHM operations in WIN_START (and vice versa in WIN_COMPLETE).

In WIN_POST, the MEM_SYNC before PROC_SYNC essentially exposes previous local loads/stores to group of origin processes; after PROC_SYNC, origin processes knows all target processes already exposed their local loads/stores; in WIN_START, the MEM_SYNC after PROC_SYNC ensures that following SHM operations will happen after PROC_SYNC and will see the latest data on target processes.

In WIN_COMPLETE, the MEM_SYNC before PROC_SYNC essentailly exposes previous SHM operations to group of target processes; after PROC_SYNC, target processes knows all origin process already exposed their SHM operations; in WIN_WAIT/TEST, the MEM_SYNC after PROC_SYNC ensures that following local loads/stores will happen after PROC_SYNC and will see the latest data in my memory region.

3. Passive target synchronization

Note that for passive target synchronization, when one process wants to access its own window memory, it must do those accesses within an passive epoch (make local loads/stores become SHM operations), or using MPI_WIN_SYNC before and after local loads/stores, otherwise those local accesses may be reordered and leads to wrong results (clarification is needed for this in MPI specification). See following for details.

             RANK 0                          RANK 1
   
                                       WIN_LOCK(target=1) {
                                           PROC_SYNC (lock granted)
                                           MEM_SYNC
                                       }
  
                                       (SHM operations)
  
                                       WIN_UNLOCK(target=1) {
                                           MEM_SYNC
                                           PROC_SYNC (lock released)
                                       }
  
        PROC_SYNC -------------------- PROC_SYNC
  
        WIN_LOCK (target=1) {
            PROC_SYNC (lock granted)
            MEM_SYNC
       }
 
        (SHM operations)
 
        WIN_UNLOCK (target=1) {
            MEM_SYNC
            PROC_SYNC (lock released)
        }
 
        PROC_SYNC -------------------- PROC_SYNC
 
                                       WIN_LOCK(target=1) {
                                           PROC_SYNC (lock granted)
                                           MEM_SYNC
                                       }
  
                                       (SHM operations)
  
                                       WIN_UNLOCK(target=1) {
                                           MEM_SYNC
                                           PROC_SYNC (lock released)
                                       }
  

We need MEM_SYNC after PROC_SYNC in WIN_LOCK, and MEM_SYNC before PROC_SYNC in WIN_UNLOCK, to ensure the ordering between SHM operations and PROC_SYNC and vice versa.

In WIN_LOCK, the MEM_SYNC after PROC_SYNC guarantees two things: (a) it guarantees that following SHM operations will happen after lock is granted; (b) it guarantees that following SHM operations will happen after any PROC_SYNC with target before WIN_LOCK is called, which means those SHM operations will see the latest data on target process.

In WIN_UNLOCK, the MEM_SYNC before PROC_SYNC also guarantees two things: (a) it guarantees that SHM operations will happen before lock is released; (b) it guarantees that SHM operations will happen before any PROC_SYNC with target after WIN_UNLOCK is returned, which means following SHM operations on that target will see the latest data.

WIN_LOCK_ALL/UNLOCK_ALL are same with WIN_LOCK/UNLOCK.

             RANK 0                          RANK 1
 
        WIN_LOCK_ALL
 
        (SHM operations)
 
        WIN_FLUSH(target=1) {
            MEM_SYNC
        }
 
        PROC_SYNC ----------------------- PROC_SYNC
 
                                          WIN_LOCK(target=1) {
                                              PROC_SYNC (lock granted)
                                              MEM_SYNC
                                          }
 
                                          (SHM operations)
 
                                          WIN_UNLOCK(target=1) {
                                              MEM_SYNC
                                              PROC_SYNC (lock released)
                                          }
 
        WIN_UNLOCK_ALL

We need MEM_SYNC in WIN_FLUSH to ensure the ordering between SHM operations and PROC_SYNC.

The MEM_SYNC in WIN_FLUSH guarantees that all SHM operations before this WIN_FLUSH will happen before any PROC_SYNC with target after this WIN_FLUSH, which means SHM operations on target process after PROC_SYNC with origin will see the latest data.

User can also use MPI_WIN_SYNC in passive target synchronization to guarantee the ordering of load/store operations.

             RANK 0                          RANK 1
   
                                         WIN_SYNC {
                                             MEM_SYNC
                                         }
  
                                       (local loads/stores)
  
                                         WIN_SYNC {
                                             MEM_SYNC
                                         }
  
        PROC_SYNC -------------------- PROC_SYNC
  
        WIN_LOCK (target=1) {
            PROC_SYNC (lock granted)
            MEM_SYNC
       }
 
        (SHM operations)
 
        WIN_UNLOCK (target=1) {
            MEM_SYNC
            PROC_SYNC (lock released)
        }
 
        PROC_SYNC -------------------- PROC_SYNC
 
                                       WIN_SYNC {
                                           MEM_SYNC
                                       }
  
                                       (local loads/stores)
  
                                       WIN_SYNC {
                                           MEM_SYNC
                                       }

Reduce O(p) structure on window

This includes sharing window information within the node and local cache for remote window information.

(From Prof. Gropp's comments)

  1. Detect common arguments (e.g., the same displacement unit or size used on all processes). Possible enhancement: consider as an integer map, and share the group compression code. Ditto other data.
  2. Based on CH3 channel needs, determine whether offsets are needed for all process or just the local process. In the case where no direct RMA is available, there is no need for all processes to hold

the offsets.

  1. For data that must be available, consider caching and/or accessing with one-sided operations (e.g., only hold the window base addresses for the windows that you are accessing; have a way to handle a cache "miss").
  2. Related to 3, store shared information in a single, shared data structure for all processes on the same SMP. Consider extending this approach to other MPI objects.

Datatype in RMA

(From Prof. Gropp's comments)

The current RMA code doesn’t interface correctly to the dataloop/datatype code. The problem is that it apparently expects a “flattened” version of the dataloop, which can be terrible for performance. At the least, we should consider the following:

  1. Contiguous must be very fast
  2. Strided should be fast (e.g., vector types); this probably means a special case for this
  3. Datatypes used at the target may be cached and do not need to be sent every time
    1. Caches can be shared on nodes.
    2. Caches need to flush data, so it may be necessary to refresh a cached item
  4. Non contiguous data should be pipelined if large enough, rather than packed first and then sent in a contiguous lump.
  5. This must be integrated with one-sided support in the interconnect hardware.

Ordering

MPICH provides a netmod API (get_ordering) to allow netmod to expose the ordering of Active-Messages-based (AM-based) operations (am_ordering). A netmod may issue some packets via multiple connections in parallel (such as RMA), and those packets can be unordered. In such case, the netmod should return 0 from get_ordering function; otherwise, the netmod returns 1 from get_ordering function. The default value of am_ordering in MPICH is 0. Setting it to 1 may improve the performance of runtime because it does not need to maintain ordering among AM-based operations. One example of implementing the API in MXM netmod is as follows:

  int MPID_nem_mxm_get_ordering(int *ordering)
  {
    (*ordering) = 1;
    return MPI_SUCCESS;
  }

When am_ordering is 1, the MPICH runtime only issues a REQ_ACK at last when it wants to detect remote completion of all operations; when am_ordering is 0, the runtime needs to issue a REQ_ACK with every issued operations to ensure that any future remote completion detection will be correct.