Establishing Socket Connections

From Mpich
Revision as of 16:14, 10 November 2012 by Balaji (talk | contribs)

Jump to: navigation, search

NOTE: this document is very rough, incomplete, and was created by analyzing existing code rather than writing code to follow this. As such, take any statements in here with a big grain of salt.

Sock conn protocol is a related document, although it was created independently from this document.

Socket Connections

Connections in MPICH are established as necessary, providing better scalability and reducing startup time. In addition, this approach reduces the consumption of Unix file descriptors; there are a limited number of these (as few as 1024 in some systems), and while this may seem like a lot, clusters with more nodes than this are becoming common.

The connection process is best described with a state diagram and events that cause transitions between states. When the MPI dynamic process features are included, a connection my be made, closed, and reopened, possibly many times during a computation.

Basic events that change state

  1. Receive connection request
  2. Receive close request
  3. Receive EOF (unexpected close)

Basic States

  1. Unconnected
  2. Connected

ToDo: expand the states to include all transitions (including failure and connection requests received when the connection is not in the unconnected state (e.g., in closing and reopening).

Also ToDo: describe related information (e.g., PMI process group info).

Likely additional states include

  1. wait for connect info
  2. connect handshake
  3. wait for close handshake

A common problem is the one of two processes each opening connections to each other. The socket code assume that the sockets are bidirectional, thus only one socket is needed by each pair of connected processes, not one socket for each member of the pair.

ToDo: refactor the states and state machine into a clear set of VC connection states and connection states.

There are three related objects used during a connection event. They are the connection itself (a structure specific to the communication method, sockets in the case of this note), the virtual connection, and the process group to which the virtual connection belongs. Note that the reference counts on the VC and the process group are independent of whether there is a connection; the reference counts on the VC indicate how many communicators refer to that VC; the reference count on the process group indicates how many VCs are part of some communicator. Thus, none of the connection operations (whether open or close) change the reference count on either a VC or a process group.

The following describes the process of establishing a connection, based on analyzing the code (note that this should have been documented first rather than created in an ad hoc fashion).

Connect side

Note that the VC must already exist (all VCs are created with the associated process group).

An attempt to send a message detects that the VC is in state
VC_STATE_UNCONNECTED (e.g., in ch3_istartmsg).  Then:
    save message in a request added to the send queue (SendQ_enqueue(vc))
       set vc->state to VC_STATE_CONNECTING
       get connection string from PG, get connection info from string
           allocates space for Connection_t and pgid 
                we need the pgid to identify the VC (??)
                (note: instead of pgid, connections should simply 
                 point at the PG object)
                (note: this step should initialize the conn fields)
           creates the socket, sets sock opts, allocates internal sock
           structure, adds the socket to the poll list
           execute system connect(sock)
           set sock state to CONNECTED_RW, CONNECTING, or (on error) 
           returns the socket object (MPIDU_Sock *)
       conn->state = CONN_STATE_CONNECTING
       init conn fields (note: move into allow)
(at this point, wait for the connect to become ready, which will cause a 
SOCK_OP_CONNECT event.  Thus:)

in ch3_progress, in 
    if (event->op == MPIDU_SOCK_OP_CONNECT)
         # Note that when we get a connection request, we don't yet know
         # what VC this connection is for.  We get that information 
         # by being send the pg_id for the process group and the
         # rank of the VC within that process group.
         (note that this routine checks for event error; not the right place)
         if (conn == CONN_STATE_CONNECTING)
              conn->state = CONN_STATE_OPEN_CSEND
              initialize a packet contained within the conn structure to
              PKT_SC_OPEN_REQ, send length of pg_id and pg_rank
              (note: for hetero, these need to be in fixed byteorder, length)
                  also sends the pgid itself
                  (note: perhaps this and related routines should be a 
                  general ch3 routine)
              (note: should move formation of all data to send into a 
               single routine)
              Sock_post_writev() for this (pkt + pg_id)
              return to handle_sock_event

     if (event->op == MPIDU_SOCK_OP_WRITE) 
        if (!conn->send_active) 
           (assumes finishing connection write)
               if (state == CONN_STATE_OPEN_CSEND) 
                   conn->state = CONN_STATE_OPEN_CRECV
                   return to handle_sock_event

     if (event->op == MPIDU_SOCK_OP_READ) {
          If (pkt_type == PKT_SC_OPEN_RESP)
             if (pkt->ack is true)
                 conn->state = STATE_CONNECTED
                 vc->state = VC_STATE_CONNECTED
                     If we had enqueued a send of a request, start it
                 # Close connection because this was closed on the
                 # other end (?), probably because it lost the
                 # head-to-head connection race.
                 conn->state = CONN_STATE_CLOSING
                 conn->vc = NULL 
                 # note: vc state itself is unchanged (discarding this
                 # connection, not the associated vc)

Accept side

As part of initialization, an accept is issued on a listener socket
CH3I_Progress_init sets up the listener socket (note: this code should be 
moved into the ch3u_connect_sock file and global variables 
made local (static) there.  This post a socket state for OP_ACCEPT.  When
a connect call is made to the listener on this process, starting this

    if (event->op == MPIDU_SOCK_OP_ACCEPT)
            Allocate connection (same routine as in connect)
            MPIDU_Sock_accept( conn )
                executes accept, set sock opts 
                (note: there should be a common routine for setting the
                 sock opts on connect and accept)
                 initialize sock and associated poll structures
            initialize conn fields
            conn->state = OPEN_LRECV_PKT
                    adds this buffer to pending reads on this FD
         return from handle
         (note: no code checks for the conn state to be OPEN_LRECV_PKT.

    if (event->op == MPIDU_SOCK_OP_READ) 
        if (conn->state not recognized) (note: bad code style)
            Sockconn_handle_conn_event( conn ) (conn comes from user_ptr in 
                if (conn->pkt is PKT_SC_OPEN_REQ)
                    (added check that conn state == OPEN_LRECV_PKT)
                    conn->state = OPEN_LRECV_DATA
                    (read the process group id)
                       (a non-blocking read)
                    return to handle_sock_event

     if (event->op == MPIDU_SOCK_OP_READ)
         if (conn->state == OPEN_LRECV_DATA) {
                 The conn->pg_id field is now set.
                 (find the corresponding process group.  We are 
                  guaranteed to find the pg)
             the connection pkt still contains the pg_rank for this
             Find the corresponding virtual connection (note that on
             an accept operation, we don't know until this point the
             vc for this connection request)

             (at this point, we need to check for head-to-head connections, 
              since we may already be attempting to form this VC, having 
              originated a connection from this side).
             if (vc->conn == NULL || (mypg < pg) || 
                 (pg == mypg && myrank < pg_rank of conn) ) 
                 not head to head OR winner of head-to-head.  
                 Continue with connection
                 VC state is now initialized to VC_STATE_CONNECTING
                 vc->conn is set to this connection, and the associated
                 sock is also set
                 conn->vc = vc
             In all cases, return an ack:
             conn->state = OPEN_LSEND (note, even when refusing connection)
             conn->pkt = MPIDI_CH3I_PKT_SC_OPEN_RESP
                   ptk.ack = true if accepting, false otherwise

     if (event->op == MPIDU_SOCK_OP_WRITE)
         if (conn->state == OPEN_LSEND) {
             finished sending response packet.
             if (conn.pkt->ack is true)
                 (note: this should use the same code as the connect brach)
                 conn->state = CONN_STATE_CONNECTED
                 vc->state = VC_STATE_CONNECTED
                 conn->state = CONN_STATE_CLOSING
                    This primarily enquees SOCK_OP_CLOSE event

On a close sock event: (to do; this isn't ready since it is clear that additional events occur before we'd get to this point)

   case SOCK_OP_CLOSE:
       Sockconn_handle_close_event( conn )
           conn->vc->ch.state = STATE_UNCONNECTED
               (EVENT_TERMINATED is the only event type,
                we should integrate this with the other connection 
                state change events)
               switch (vc->state)
                  VC_STATE_CLOSE_ACKED: (this must be set because the
                                         default generates an error)

Note also that mpid_finalize.c contains some vc close code (most of this, including all code that performs state changes, should not be in this file).

Unresolved questions. There are at least three sets of states:

conn->state - defined in ch3/channels/*/include/mpidi_ch3_impl.h 
vc->state - defined in ch3/include/mpidpre.h 
vc->ch.state - defined in  ch3/channels/*/include/mpidi_ch3_pre.h

Why is there a separate vc state and vc channel state? Are the vc->ch.state values really different from vc->state, and how do we ensure that changes to vc->state and vc->ch.state are made consistently?


Here is a guess at the state transitions in the current implementation.

On accept side:

Enqueuing accept connection
Dequeueing accept connection
    VC_STATE_CLOSE_ACKED   (yes, state was set twice) (? perhaps separate vc?)
   ( this appears to be the connection used to establish the original 
     intercomm; the connection is closed for some reason)    

   ( then an ch3_istartmsg causes the following )
   posting connect and enqueuing request