Establishing Socket Connections
NOTE: this document is very rough, incomplete, and was created by analyzing existing code rather than writing code to follow this. As such, take any statements in here with a big grain of salt.
Sock conn protocol is a related document, although it was created independently from this document.
Socket Connections
Connections in MPICH2 are established as necessary, providing better scalability and reducing startup time. In addition, this approach reduces the consumption of Unix file descriptors; there are a limited number of these (as few as 1024 in some systems), and while this may seem like a lot, clusters with more nodes than this are becoming common.
The connection process is best described with a state diagram and events that cause transitions between states. When the MPI dynamic process features are included, a connection my be made, closed, and reopened, possibly many times during a computation.
Basic events that change state
- Receive connection request
- Receive close request
- Receive EOF (unexpected close)
Basic States
- Unconnected
- Connected
ToDo: expand the states to include all transitions (including failure and connection requests received when the connection is not in the unconnected state (e.g., in closing and reopening).
Also ToDo: describe related information (e.g., PMI process group info).
Likely additional states include
- wait for connect info
- connect handshake
- wait for close handshake
A common problem is the one of two processes each opening connections to each other. The socket code assume that the sockets are bidirectional, thus only one socket is needed by each pair of connected processes, not one socket for each member of the pair.
ToDo: refactor the states and state machine into a clear set of VC connection states and connection states.
There are three related objects used during a connection event. They are the connection itself (a structure specific to the communication method, sockets in the case of this note), the virtual connection, and the process group to which the virtual connection belongs. Note that the reference counts on the VC and the process group are independent of whether there is a connection; the reference counts on the VC indicate how many communicators refer to that VC; the reference count on the process group indicates how many VCs are part of some communicator. Thus, none of the connection operations (whether open or close) change the reference count on either a VC or a process group.
The following describes the process of establishing a connection, based on analyzing the code (note that this should have been documented first rather than created in an ad hoc fashion).
Connect side
Note that the VC must already exist (all VCs are created with the associated process group).
An attempt to send a message detects that the VC is in state VC_STATE_UNCONNECTED (e.g., in ch3_istartmsg). Then: save message in a request added to the send queue (SendQ_enqueue(vc)) VC_post_connect(vc) set vc->state to VC_STATE_CONNECTING get connection string from PG, get connection info from string Connection_alloc(&conn) allocates space for Connection_t and pgid we need the pgid to identify the VC (??) (note: instead of pgid, connections should simply point at the PG object) (note: this step should initialize the conn fields) Sock_post_connect_ifaddr(conn) creates the socket, sets sock opts, allocates internal sock structure, adds the socket to the poll list execute system connect(sock) set sock state to CONNECTED_RW, CONNECTING, or (on error) DISCONNECTED returns the socket object (MPIDU_Sock *) conn->state = CONN_STATE_CONNECTING init conn fields (note: move into allow) (at this point, wait for the connect to become ready, which will cause a SOCK_OP_CONNECT event. Thus:) in ch3_progress, in Handle_sock_event if (event->op == MPIDU_SOCK_OP_CONNECT) # Note that when we get a connection request, we don't yet know # what VC this connection is for. We get that information # by being send the pg_id for the process group and the # rank of the VC within that process group. Sockconn_handle_connect_event() (note that this routine checks for event error; not the right place) if (conn == CONN_STATE_CONNECTING) conn->state = CONN_STATE_OPEN_CSEND initialize a packet contained within the conn structure to PKT_SC_OPEN_REQ, send length of pg_id and pg_rank (note: for hetero, these need to be in fixed byteorder, length) connection_post_send_pkt_and_pgid also sends the pgid itself (note: perhaps this and related routines should be a general ch3 routine) (note: should move formation of all data to send into a single routine) Sock_post_writev() for this (pkt + pg_id) return to handle_sock_event if (event->op == MPIDU_SOCK_OP_WRITE) if (!conn->send_active) (assumes finishing connection write) Sockconn_handle_connwrite() if (state == CONN_STATE_OPEN_CSEND) conn->state = CONN_STATE_OPEN_CRECV connection_post_recv_pkt return to handle_sock_event if (event->op == MPIDU_SOCK_OP_READ) { If (pkt_type == PKT_SC_OPEN_RESP) if (pkt->ack is true) conn->state = STATE_CONNECTED vc->state = VC_STATE_CONNECTED connection_post_recv_pkt(conn) connection_post_sendq_req(conn) If we had enqueued a send of a request, start it else # Close connection because this was closed on the # other end (?), probably because it lost the # head-to-head connection race. conn->state = CONN_STATE_CLOSING Sock_post_close(conn->sock) conn->vc = NULL # note: vc state itself is unchanged (discarding this # connection, not the associated vc)
Accept side
As part of initialization, an accept is issued on a listener socket CH3I_Progress_init sets up the listener socket (note: this code should be moved into the ch3u_connect_sock file and global variables made local (static) there. This post a socket state for OP_ACCEPT. When a connect call is made to the listener on this process, starting this sequence: Progress_handle_sock_event if (event->op == MPIDU_SOCK_OP_ACCEPT) Sockconn_handle_accept_event() Allocate connection (same routine as in connect) MPIDU_Sock_accept( conn ) executes accept, set sock opts (note: there should be a common routine for setting the sock opts on connect and accept) initialize sock and associated poll structures initialize conn fields conn->state = OPEN_LRECV_PKT connection_post_recv_pkt Sock_post_read adds this buffer to pending reads on this FD return from handle (note: no code checks for the conn state to be OPEN_LRECV_PKT. Progress_handle_sock_event if (event->op == MPIDU_SOCK_OP_READ) if (conn->state not recognized) (note: bad code style) Sockconn_handle_conn_event( conn ) (conn comes from user_ptr in event) if (conn->pkt is PKT_SC_OPEN_REQ) (added check that conn state == OPEN_LRECV_PKT) conn->state = OPEN_LRECV_DATA (read the process group id) Sock_post_read(pg_id,pkt->pg_id_len) (a non-blocking read) return to handle_sock_event Progress_handle_sock_event if (event->op == MPIDU_SOCK_OP_READ) if (conn->state == OPEN_LRECV_DATA) { Sockconn_handle_connopen_event(conn) The conn->pg_id field is now set. (find the corresponding process group. We are guaranteed to find the pg) MPIDI_PG_Find(conn->pg_id,&pg). the connection pkt still contains the pg_rank for this connection Find the corresponding virtual connection (note that on an accept operation, we don't know until this point the vc for this connection request) MPIDI_PG_Get_vc(pg,pg_rank,&vc); (at this point, we need to check for head-to-head connections, since we may already be attempting to form this VC, having originated a connection from this side). if (vc->conn == NULL || (mypg < pg) || (pg == mypg && myrank < pg_rank of conn) ) not head to head OR winner of head-to-head. Continue with connection VC state is now initialized to VC_STATE_CONNECTING vc->conn is set to this connection, and the associated sock is also set conn->vc = vc In all cases, return an ack: conn->state = OPEN_LSEND (note, even when refusing connection) conn->pkt = MPIDI_CH3I_PKT_SC_OPEN_RESP ptk.ack = true if accepting, false otherwise Sock_post_write(pkt) if (event->op == MPIDU_SOCK_OP_WRITE) if (conn->state == OPEN_LSEND) { finished sending response packet. if (conn.pkt->ack is true) (note: this should use the same code as the connect brach) conn->state = CONN_STATE_CONNECTED connection_post_recv_pkt connection_post_sendq_req vc->state = VC_STATE_CONNECTED else conn->state = CONN_STATE_CLOSING Sock_post_close(conn->sock) This primarily enquees SOCK_OP_CLOSE event
On a close sock event: (to do; this isn't ready since it is clear that additional events occur before we'd get to this point)
case SOCK_OP_CLOSE: Sockconn_handle_close_event( conn ) conn->vc->ch.state = STATE_UNCONNECTED Handle_connection(vc,EVENT_TERMINATED) (EVENT_TERMINATED is the only event type, we should integrate this with the other connection state change events) switch (vc->state) VC_STATE_CLOSE_ACKED: (this must be set because the default generates an error)
Note also that mpid_finalize.c
contains some vc close code (most of this, including all code that performs state changes, should not be in this file).
Unresolved questions. There are at least three sets of states:
conn->state - defined in ch3/channels/*/include/mpidi_ch3_impl.h vc->state - defined in ch3/include/mpidpre.h vc->ch.state - defined in ch3/channels/*/include/mpidi_ch3_pre.h
Why is there a separate vc state and vc channel state? Are the vc->ch.state values really different from vc->state, and how do we ensure that changes to vc->state and vc->ch.state are made consistently?
OLD
Here is a guess at the state transitions in the current implementation.
On accept side:
CONN_STATE_OPEN_LRECV_PKT VC_STATE_CONNECTING CONN_STATE_OPEN_LSEND Enqueuing accept connection CONN_STATE_CONNECTED VC_STATE_CONNECTED Dequeueing accept connection VC_STATE_LOCAL_CLOSE VC_STATE_CLOSE_ACKED VC_STATE_CLOSE_ACKED (yes, state was set twice) (? perhaps separate vc?) CONN_STATE_CLOSING VC_STATE_UNCONNECTED VC_STATE_INACTIVE CONN_STATE_CLOSED ( this appears to be the connection used to establish the original intercomm; the connection is closed for some reason) ( then an ch3_istartmsg causes the following ) posting connect and enqueuing request VC_STATE_CONNECTING CONN_STATE_CONNECTING CONN_STATE_OPEN_CSEND CONN_STATE_OPEN_CRECV CONN_STATE_OPEN_LRECV_DATA