Difference between revisions of "The Progress Engine"

From Mpich
Jump to: navigation, search
(Nonblocking Collective Schedule List)
Line 14: Line 14:
 
== Send Progress ==
 
== Send Progress ==
 
== LMT Progress ==
 
== LMT Progress ==
== Nonblocking Collective Schedule List ==
+
== Nonblocking Collective Progress ==
The Schedule List
+
The Schedule List of nonblocking collective schedules is a
 
two-level linked list: the first level is a linked list of <code>struct MPIDU_Sched</code>; and each <code>struct MPIDU_Sched</code> is an array of <code>struct MPIDU_Sched_entry</code>. Codes are located in <code>mpid_sched.c</code>. The list is defined as a static variable <code>all_schedules</code> in this file.
 
two-level linked list: the first level is a linked list of <code>struct MPIDU_Sched</code>; and each <code>struct MPIDU_Sched</code> is an array of <code>struct MPIDU_Sched_entry</code>. Codes are located in <code>mpid_sched.c</code>. The list is defined as a static variable <code>all_schedules</code> in this file.
  

Revision as of 15:54, 13 May 2014

The progress engine is a series of lists for MPI internal tasks such as: send, receive, collectives, and so on.

Algorithm Pseudo Code

do {
  /* make progress on receiving */
  /* make progress on sending */
  /* make progress on LMTs */
  /* make progress on NBC schedules */
} while (is_blocking)

Receive Progress

Send Progress

LMT Progress

Nonblocking Collective Progress

The Schedule List of nonblocking collective schedules is a two-level linked list: the first level is a linked list of struct MPIDU_Sched; and each struct MPIDU_Sched is an array of struct MPIDU_Sched_entry. Codes are located in mpid_sched.c. The list is defined as a static variable all_schedules in this file.

Progress-engine.png

Data Structures & APIs

Following are the data structures used in the progress engine and the APIs to access the data structures.

The Schedule List

Functions that Directly Access all_schedules

MPIDU_Sched_are_pending

Read only. Check if all_schedules is empty.

MPID_Sched_next_tag

Read only. Check the tags in all_schedule.

MPID_Sched_start

Append an entry to the schedule list.

MPL_DL_APPEND(all_schedules.head, s);
MPIDU_Sched_progress_state

Process the schedule list, delete a schedule if all its entries have been processed.

/* process the list */
MPL_DL_FOREACH_SAFE(state->head, s, tmp) {
    for (i = s->idx; i < s->num_entries; ++i) {
        /* process entries */
    }
    if (s->idx == s->num_entries) {
        MPL_DL_DELETE(state->head, s);
    }
}

MPIDU_Sched_progress_state is called by MPIDU_Sched_progress

The Schedule Object

MPID_Sched_create

Create a new schedule object.

MPIDU_Sched_add_entry

Add an entry to a schedule.

The Schedule Entries

Each schedule has an array of entries. The data structure of an entry is as follows:

struct MPIDU_Sched_entry {
    enum MPIDU_Sched_entry_type type;
    enum MPIDU_Sched_entry_status status;
    int is_barrier;
    union {
        struct MPIDU_Sched_send send;
        struct MPIDU_Sched_recv recv;
        struct MPIDU_Sched_reduce reduce;
        struct MPIDU_Sched_copy copy;
        /* nop entries have no args */
        struct MPIDU_Sched_cb cb;
    } u;
};

type is used for handling different situations in the progress engine. Different types are called differently depend on which type they are (see #MPIDU_Sched_start_entry and #MPIDU_Sched_progress_state_2 ). The following is a list of entry types:

enum MPIDU_Sched_entry_type {
    MPIDU_SCHED_ENTRY_INVALID_LB = 0,
    MPIDU_SCHED_ENTRY_SEND,
    MPIDU_SCHED_ENTRY_RECV,
    MPIDU_SCHED_ENTRY_REDUCE,
    MPIDU_SCHED_ENTRY_COPY,
    MPIDU_SCHED_ENTRY_NOP,
    MPIDU_SCHED_ENTRY_CB,
    MPIDU_SCHED_ENTRY_INVALID_UB
};

status is used for handling different stages of a schedule entry. SEND and RECV entries will change their status from NOT_STARTED to STARTED, then to COMPLETE. REDUCE, COPY and CB entries will change directly from NOT_STARTED to COMPLETE in MPIDU_Sched_start_entry.

enum MPIDU_Sched_entry_status {
    MPIDU_SCHED_ENTRY_STATUS_NOT_STARTED = 0,
    MPIDU_SCHED_ENTRY_STATUS_STARTED,
    MPIDU_SCHED_ENTRY_STATUS_COMPLETE,
    MPIDU_SCHED_ENTRY_STATUS_FAILED, /* indicates a failure occurred while executing the entry */
    MPIDU_SCHED_ENTRY_STATUS_INVALID /* indicates an invalid entry, or invalid status value */
};

is_barrier is used to control the calling order of the entries in a schedule. The entries marked as is_barrier will not surpass its predecessor. It will control the behavior of a entry in function MPIDU_Sched_continue and MPIDU_Sched_progress_state.

MPIDU_Sched_start_entry

All types are called inside MPIDU_Sched_start_entry. It is called in the following order:

MPID_Sched_start
MPIDU_Sched_continue
MPIDU_Sched_start_entry

For entries with a SCHEDULE/COPY/CB type, their status will be changed from NOT_STARTED to COMPLETE after the call.

For entries with a SEND/RECV type, their status will be changed from NOT_STARTED to STARTED.

MPIDU_Sched_progress_state

Only SEND and RECV are called inside MPIDU_Sched_progress_state because only the entries in these two types has three status instead of two, as mention before.

MPID_Sched_barrier

MPID_Sched_barrier mark its predecessor entry's is_barrier as 1.

Algorithm

The progress engine is a part of non-blocking calls to overlap computation and communication. In order to achieve overlap, a non-blocking calls only adds an schedule to the progress engine and returns immediately.

The question is: when is the progress engine called? See following examples:

  • MPI_Comm_idup

The code snippet is like this:

for (i = 0; i < NUM_ITER; i++)
    MPI_Comm_idup(MPI_COMM_WORLD, &comms[i], &req[i])
MPI_Waitall(NUM_ITER, req, MPI_STATUSES_INGORE);

When MPI_Comm_idup is called, it registers a callback funtion gcn_helper in the progress engine. This function is not called until MPI_Waitall is called. The call stack is:

gcn_helper
MPIDU_Sched_start_entry
MPIDU_Sched_continue
MPIDU_Sched_progress_state
MPIDU_Sched_progress
MPIDI_CH3I_Progress
MPIR_Waitall_impl
MPI_Waitall

Reference