Difference between revisions of "Progress of scalable RMA implementation"
From Mpich
Line 6: | Line 6: | ||
## <s> Global pool for data in RMA operation piggybacked with LOCK </s> | ## <s> Global pool for data in RMA operation piggybacked with LOCK </s> | ||
### Compare performance of defragmentation by MPICH and by OS | ### Compare performance of defragmentation by MPICH and by OS | ||
+ | ### BGQ does not do defragmentation | ||
## Streaming ACC operation | ## Streaming ACC operation | ||
## <s> Scalability problem with derived datatypes </s> | ## <s> Scalability problem with derived datatypes </s> | ||
Line 11: | Line 12: | ||
### <s> Fixing it needs flow control in mpich which brings a lot of overhead, so here we will not fix this. </s> | ### <s> Fixing it needs flow control in mpich which brings a lot of overhead, so here we will not fix this. </s> | ||
## <s> Allow user to pass MPI info hints to indicate if they will use passive target on this window, if not, we do not need to allocate lock request / data block pool. </s> | ## <s> Allow user to pass MPI info hints to indicate if they will use passive target on this window, if not, we do not need to allocate lock request / data block pool. </s> | ||
− | ## Manage internal requests? --- currently it is bounded by a CVAR. | + | ## <s> Manage internal requests? --- currently it is bounded by a CVAR. </s> |
# Performance profiling | # Performance profiling | ||
## OSU benchmark: simple latency and bandwidth | ## OSU benchmark: simple latency and bandwidth |
Revision as of 17:13, 17 December 2014
Note: Please do not delete words with strikethrough.
- Scalability issues:
-
Global pool for lock requests -
Global pool for data in RMA operation piggybacked with LOCK- Compare performance of defragmentation by MPICH and by OS
- BGQ does not do defragmentation
- Streaming ACC operation
-
Scalability problem with derived datatypes-
Send/recv has the same problem. -
Fixing it needs flow control in mpich which brings a lot of overhead, so here we will not fix this.
-
-
Allow user to pass MPI info hints to indicate if they will use passive target on this window, if not, we do not need to allocate lock request / data block pool. -
Manage internal requests? --- currently it is bounded by a CVAR.
-
- Performance profiling
- OSU benchmark: simple latency and bandwidth
- Poking progress engine worsen the performance, should optimize the progress engine
- Graph500
- Poking progress engine in operation routines improve the performance
- MXM RC is worse then UD on Fusion
- Commit 1555ef7 MPICH-specific initialization of mxm
- Routine mxm_invoke_callback() spend a lot of time
- Reproduce it in a simple program
- Needs to test more data sets
- OSU benchmark: simple latency and bandwidth
- Channel hooks and Netmod hooks
- Window creation: done
- Synchronization: add hardware flush (performance is not good), other synchronizations are not done
- Issues (TODO):
- PUT_SYNC + FENCE does not guarantee remote completion in SHM environment (i.e., NOLOCAL), we may need handle it in MPICH or MXM provides an unified interface for remote completion.
- MXM HW flush does not call MPI progress engine, so when it is blocking in HW flush we cannot make progress on other SW parts (may deadlock in multithread ?).
- Performance improvement (DONE):
- MXM HW flush internally issues PUT_SYNC out only when there are PUT/GET operations which are issued out and waiting for remote completion, otherwise do nothing. This optimization can avoid unnecessary HW flush when there is no operation waiting for remote completion.
- PUT is always waiting for remote completion before HW flush.
- GET is waiting for remote completion until local completion
- MXM HW flush internally issues PUT_SYNC out only when there are PUT/GET operations which are issued out and waiting for remote completion, otherwise do nothing. This optimization can avoid unnecessary HW flush when there is no operation waiting for remote completion.
- Possible optimization (TODO):
- HW flush can be ignored if SW flush is also issued (i.e., SW / HW OPs mixed case), because MXM guarantees the ordering of all OPs including Send/Receive and RDMA issued on the same connection. CH3 may give netmod a hint (i.e., force_flag) in HW flush, so netmod can ignore HW flush if strict ordering is supported as that in MXM.
- In GET-only case, we do not need issue HW flush in MXM, instead we can wait for a counter decremented by GET local completion.
- Issues (TODO):
- RMA operations: PUT/GET done
- Possible optimization (TODO):
- In MXM RMA, only need return MPI_Request for Rput/Rget.
- Possible optimization (TODO):
- Rebase on mpich/master: done
- Code might needs to be cleaned up
- Epoch check in win_free
- Add epoch check in nemesis win_free (DONE)
- Add a error test to check epoch check in win_free for allocate_shm window (see /errors/rma/win_sync_free_at) (TODO)
- Merge shm_win_free into nemesis_win_free so that it is called after netmod win_free (TODO)
- Miscellaneous TODOs:
- Add new PVARs for RMA:
- Do this at last
- Support for multi-level request handlers
- Support for network-specific AM orderings
- window does not store infos passed by user
- Design document
- Add new PVARs for RMA:
- Very detailed TODOs:
- Completely remote immed_len in pkt struct? Does origin type and target type must be the same if basic datatype is specified?
- FOP operation and CAS operation, specify IMMED data size??
- Find packet header limit, do not make them too big
- Make sure aggressive cleanup functions work correctly in all situations
- Make sure runtime works correctly when pool is very small
- Make sure graph500 runs correctly with validation