Communicators and Context IDs
- 1 What Is A Context ID?
- 2 Context ID Mask
- 3 Context Type Suffix
- 4 Context ID API
- 5 When and How Context IDs Are Selected For Communicators
What Is A Context ID?
When MPI receives a message and matches it against MPI_Recv requests, it compares the message's envelope to the MPI_Recv's envelope. The envelope is the triple of (source, tag, communicator). The source and tag are explicitly integers, yet the communicator is a logical construct indicating a particular communication context. In MPICH2 this context is implemented via an additional tag field known as the context id. It's worth remembering that there is no wild card matching for communicators.
The MPICH2 context ID is a 16-bit integer field that is structured as follows:
In this crude diagram each character represents a bit. There are three fields of the context ID indicated by letter and color:
- Mask Word Index (
- This is the index into the context ID mask (explained below).
- Bit Index (
- This is which bit index within the mask word that this ID refers to.
- Context Type Suffix (
- This is used to indicate different communication contexts within a communicator. For example, user point-to-point messages (MPI_Send/MPI_Recv) occur in a different context than collective messages (MPI_Bcast, etc). This also explained further below.
The actual type of a context ID is
MPIR_Context_id_t, which is
FIXME XXX DJG I think that the context_id code is broken because it uses right-shifts of a potentially negative number to obtain the Mask Word Index. This will result in the wrong word about half the time when the top bit is set because high bits are set when shifting right. We should change to an uint16_t as the MPIR_Context_id_t.
Context ID Mask
The context ID mask is a bit vector that is used to keep track of which context IDs have been allocated. In the current code it is an array of
MAX_CONTEXT_MASK (256) 32-bit
unsigned ints for a total of 8192. Each process has its own mask and its state may vary from process to process depending on communicator membership patterns.
Mask Access And Multi-threading
Talk about critical sections, the local context mask, and
TODO finish this section In the mean time you can examine the code for a better explanation: src/mpi/comm/commutil.c:332
Problems and Gotchas
There are several issues and things to watch out for when working on the context ID code in
- The current code expects that
unsigned intvalues are 32-bits or larger. The comments imply that it needs exactly 32-bit
unsigned ints but it looks like we lucked out and it should work with larger sizes as well. This needs to be cleaned up in the current code.
- IDs are allocated from the lowest available mask integer index but the highest available bit index within that integer. This leads to a nice looking pattern when the mask is viewed as a hex string via the
MPIR_ContextMaskToStrfunction but a strange ordering of ID values (124, 120, 116, ..., 0, 252, 248, ..., 128, 380, etc).
- While new IDs are allocated in the fashion described just above, the three default communicators (
MPIR_ICOMM_WORLD) take up bits 0-2 of word 0 (prefixes 0, 4, and 8). In contrast, the first context ID allocated after
MPI_Initwill be bit 31 of word 0 (id prefix 124). This works out OK, it's just surprising when you are debugging and get
"03fffff8ffffffff..."when you print out the mask field. It wouldn't hurt to change this to something less surprising if we get the time.
- In the threaded version the out-of-IDs check should probably exit either way. If it is out of IDs, we can corrupt the mask and/or leak IDs because the
MPIR_Find_context_bitfunction does manipulate the mask.
MPI_Comm_splitfor intercommunicators essentially duplicates the work of
MPIR_Get_intercomm_contextidbut determines the
recvcontext_idin the reverse order. This has no effect on correctness after the function returns, but it can be confusing when working on the code.
MPI_Comm_splitshould probably be converted to use
- The temporary communicator used for connect/accept uses a hard coded tag of 4095. This has the potential to cause problems and should be explained. A random integer constant shouldn't exist in this type of code. Instead it should be based off of something such as
Context Type Suffix
The last two bits of the ID are used to indicate different communication contexts within a communicator. Point-to-point and collective communication occur in separate contexts and use a different suffix to form different context IDs.
There are four possible values, what do they each mean?
XXX DJG finish this section
Context ID API
static char MPIR_ContextMaskToStr(void)
Useful to dump the state of the context mask.
static void MPIR_Init_contextid(void)
Sets all of the bits of the context mask to 1 except for bits 0,1, and 2 of word 0.
static int MPIR_Find_context_bit( unsigned int local_mask )
Finds the highest bit of the lowest word that is set in the given mask. It resets that bit in the
context_mask and returns the found ID prefix.
int MPIR_Get_contextid(MPID_Comm *comm_ptr, MPIR_Context_id_t *context_id)
Allocates a new context ID prefix collectively over the given communicator
comm_ptr. Returns the new context ID in
context_id. The core of the algorithm copies the current state of the mask to a local buffer and then performs an
NMPI_Allreduce with an
MPI_BAND operation to find the intersection of valid context IDs among all participating processes. The result of this reduction is fed to
MPIR_Find_context_bit to determine the new context ID prefix.
int MPIR_Get_intercomm_contextid( MPID_Comm *comm_ptr, MPIR_Context_id_t *context_id, MPIR_Context_id_t *recvcontext_id)
MPIR_Comm_copy to get context IDs for a new intercommunicator from an old intercommunicator. Note that it returns a pair of IDs, one for sending and one for receiving.
When and How Context IDs Are Selected For Communicators
There are three predefined communicators that reserve context IDs at MPI_Init time:
MPI_COMM_WORLD(id prefix 0)
MPI_COMM_SELF(id prefix 4)
MPI_ICOMM_WORLD(id prefix 8)
This occurs here in the code: src/mpi/init/initthread.c:206
MPIR_Get_contextid(comm_ptr). This ID is the same across all the disjoint communicators that are created. That is, if
MPI_Comm_split is called such that three new communicators are created, the context ID will be the same in all three communicators (although the groups will obviously be different between communicators).
MPIR_Comm_copy which in turn calls
MPIR_Get_contextid over the source communicator. This new context ID is used for the duplicate communicator.
MPIR_Get_contextid(local_comm_ptr) to get the
context_id for the new communicator. Then roots of the groups exchange context IDs and then broadcast them to the rest of their local groups. This received value serves as the
recvcontext_id for the new communicator.
All communicators that result from a single collective split call have the same context IDs (but obviously different groups).
MPIR_Comm_copy which in turn calls
MPIR_Get_intercomm_contextid. Each group generates a
MPIR_Get_contextid. Then the roots exchange that value with each other and broadcast the result to the local group. The value received from the other side becomes the sending context ID (the field named
context_id in the
Allocate a context ID via
MPIR_Get_contextid(comm_ptr) over the connecting communicator. This is the
recvcontext_id for the new intercommunicator.
Then in the root:
- Connect to the port and create a temporary communicator (context ID 4095 (Why?)) to the root on the other side from this connection.
- Exchange global process group size, local communicator size and the context ID determined locally. This is sent via the temporary communicator.
- broadcast the received info on the local communicator
Just the root:
- exchange PG info with the accept side root
- store the received context ID as the
context_idfor the new intercommunicator.
Just the root:
- synchronize with the remote root
- free the temporary communicator
- barrier over the local communicator
The counterpart to the
connect algorithm above. It is essentially the same except the first step is to accept the connection instead of to initiate it.
This is simply implemented via a
PMI_Spawn_multiple followed by a
MPIR_Comm_connect/MPIR_Comm_accept under the hood.