Difference between revisions of "The Progress Engine"

From Mpich
Jump to: navigation, search
Line 1: Line 1:
The progress engine is a two-level linked list: the first level is a linked list of <code>struct MPIDU_Sched</code>; and each <code>struct MPIDU_Sched</code> is an array of <code>struct MPIDU_Sched_entry</code>. Codes are located in <code>mpid_sched.c</code>. The list is defined as a static variable <code>all_schedules</code> in this file.
+
The progress engine is a series of lists for MPI internal tasks such as: send, receive, collectives, and so on.
  
[[File:Progress-engine.png | 600px]]
+
== Algorithm Pseudo Code ==
 +
<pre>
 +
do {
 +
  /* make progress on receiving */
 +
  /* make progress on sending */
 +
  /* make progress on LMTs */
 +
  /* make progress on NBC schedules */
 +
} while (is_blocking)
 +
</pre>
 +
 
 +
== Nonblocking Collective Schedule List ==
 +
The Schedule List
 +
two-level linked list: the first level is a linked list of <code>struct MPIDU_Sched</code>; and each <code>struct MPIDU_Sched</code> is an array of <code>struct MPIDU_Sched_entry</code>. Codes are located in <code>mpid_sched.c</code>. The list is defined as a static variable <code>all_schedules</code> in this file.
 +
 
 +
[[File:Progress-engine.png | 400px]]
  
== Data Structures & APIs==
+
=== Data Structures & APIs===
 
Following are the data structures used in the progress engine and the APIs to access the data structures.
 
Following are the data structures used in the progress engine and the APIs to access the data structures.
=== The Schedule List ===
+
==== The Schedule List ====
 
Functions that Directly Access <code>all_schedules</code>
 
Functions that Directly Access <code>all_schedules</code>
==== MPIDU_Sched_are_pending ====
+
===== MPIDU_Sched_are_pending =====
 
Read only. Check if <code>all_schedules</code> is empty.
 
Read only. Check if <code>all_schedules</code> is empty.
==== MPID_Sched_next_tag ====
+
===== MPID_Sched_next_tag =====
 
Read only. Check the tags in <code>all_schedule</code>.
 
Read only. Check the tags in <code>all_schedule</code>.
==== MPID_Sched_start ====
+
===== MPID_Sched_start =====
 
Append an entry to the schedule list.
 
Append an entry to the schedule list.
 
<pre>
 
<pre>
Line 17: Line 31:
 
</pre>
 
</pre>
  
==== MPIDU_Sched_progress_state ====
+
===== MPIDU_Sched_progress_state =====
 
Process the schedule list, delete a schedule if all its entries have been processed.
 
Process the schedule list, delete a schedule if all its entries have been processed.
  
Line 34: Line 48:
 
<code>MPIDU_Sched_progress_state</code> is called by <code>MPIDU_Sched_progress</code>
 
<code>MPIDU_Sched_progress_state</code> is called by <code>MPIDU_Sched_progress</code>
  
=== The Schedule Object ===
+
==== The Schedule Object ====
==== MPID_Sched_create ====
+
===== MPID_Sched_create =====
 
Create a new schedule object.
 
Create a new schedule object.
  
==== MPIDU_Sched_add_entry ====
+
===== MPIDU_Sched_add_entry =====
 
Add an entry to a schedule.
 
Add an entry to a schedule.
  
=== The Schedule Entries ===
+
==== The Schedule Entries ====
 
Each schedule has an array of entries. The data structure of an entry is as follows:
 
Each schedule has an array of entries. The data structure of an entry is as follows:
 
<pre>
 
<pre>
Line 86: Line 100:
 
'''<code>is_barrier</code>''' is used to control the calling order of the entries in a schedule. The entries marked as <code>is_barrier</code> will not surpass its predecessor. It will control the behavior of a entry in function <code>MPIDU_Sched_continue</code> and <code>MPIDU_Sched_progress_state</code>.
 
'''<code>is_barrier</code>''' is used to control the calling order of the entries in a schedule. The entries marked as <code>is_barrier</code> will not surpass its predecessor. It will control the behavior of a entry in function <code>MPIDU_Sched_continue</code> and <code>MPIDU_Sched_progress_state</code>.
  
==== MPIDU_Sched_start_entry ====
+
===== MPIDU_Sched_start_entry =====
 
All types are called inside <code>MPIDU_Sched_start_entry</code>.  
 
All types are called inside <code>MPIDU_Sched_start_entry</code>.  
 
It is called in the following order:
 
It is called in the following order:
Line 99: Line 113:
 
For entries with a SEND/RECV type, their status will be changed from NOT_STARTED to STARTED.
 
For entries with a SEND/RECV type, their status will be changed from NOT_STARTED to STARTED.
  
==== MPIDU_Sched_progress_state ====
+
===== MPIDU_Sched_progress_state =====
 
Only SEND and RECV are called inside <code>MPIDU_Sched_progress_state</code> because only the entries in these two types has three status instead of two, as mention before.
 
Only SEND and RECV are called inside <code>MPIDU_Sched_progress_state</code> because only the entries in these two types has three status instead of two, as mention before.
  
==== MPID_Sched_barrier ====
+
===== MPID_Sched_barrier =====
 
<code>MPID_Sched_barrier</code> mark its predecessor entry's <code>is_barrier</code> as 1.
 
<code>MPID_Sched_barrier</code> mark its predecessor entry's <code>is_barrier</code> as 1.
  
== Algorithm ==
+
=== Algorithm ===
 
The progress engine is a part of non-blocking calls to overlap computation and communication. In order to achieve overlap, a non-blocking calls only adds an schedule to the progress engine and returns immediately.  
 
The progress engine is a part of non-blocking calls to overlap computation and communication. In order to achieve overlap, a non-blocking calls only adds an schedule to the progress engine and returns immediately.  
  
Line 129: Line 143:
 
</pre>
 
</pre>
  
== Reference ==
+
=== Reference ===
 
* [[Making_MPICH_Thread_Safe#The_Progress_Engine]]
 
* [[Making_MPICH_Thread_Safe#The_Progress_Engine]]

Revision as of 15:50, 13 May 2014

The progress engine is a series of lists for MPI internal tasks such as: send, receive, collectives, and so on.

Algorithm Pseudo Code

do {
  /* make progress on receiving */
  /* make progress on sending */
  /* make progress on LMTs */
  /* make progress on NBC schedules */
} while (is_blocking)

Nonblocking Collective Schedule List

The Schedule List two-level linked list: the first level is a linked list of struct MPIDU_Sched; and each struct MPIDU_Sched is an array of struct MPIDU_Sched_entry. Codes are located in mpid_sched.c. The list is defined as a static variable all_schedules in this file.

Progress-engine.png

Data Structures & APIs

Following are the data structures used in the progress engine and the APIs to access the data structures.

The Schedule List

Functions that Directly Access all_schedules

MPIDU_Sched_are_pending

Read only. Check if all_schedules is empty.

MPID_Sched_next_tag

Read only. Check the tags in all_schedule.

MPID_Sched_start

Append an entry to the schedule list.

MPL_DL_APPEND(all_schedules.head, s);
MPIDU_Sched_progress_state

Process the schedule list, delete a schedule if all its entries have been processed.

/* process the list */
MPL_DL_FOREACH_SAFE(state->head, s, tmp) {
    for (i = s->idx; i < s->num_entries; ++i) {
        /* process entries */
    }
    if (s->idx == s->num_entries) {
        MPL_DL_DELETE(state->head, s);
    }
}

MPIDU_Sched_progress_state is called by MPIDU_Sched_progress

The Schedule Object

MPID_Sched_create

Create a new schedule object.

MPIDU_Sched_add_entry

Add an entry to a schedule.

The Schedule Entries

Each schedule has an array of entries. The data structure of an entry is as follows:

struct MPIDU_Sched_entry {
    enum MPIDU_Sched_entry_type type;
    enum MPIDU_Sched_entry_status status;
    int is_barrier;
    union {
        struct MPIDU_Sched_send send;
        struct MPIDU_Sched_recv recv;
        struct MPIDU_Sched_reduce reduce;
        struct MPIDU_Sched_copy copy;
        /* nop entries have no args */
        struct MPIDU_Sched_cb cb;
    } u;
};

type is used for handling different situations in the progress engine. Different types are called differently depend on which type they are (see #MPIDU_Sched_start_entry and #MPIDU_Sched_progress_state_2 ). The following is a list of entry types:

enum MPIDU_Sched_entry_type {
    MPIDU_SCHED_ENTRY_INVALID_LB = 0,
    MPIDU_SCHED_ENTRY_SEND,
    MPIDU_SCHED_ENTRY_RECV,
    MPIDU_SCHED_ENTRY_REDUCE,
    MPIDU_SCHED_ENTRY_COPY,
    MPIDU_SCHED_ENTRY_NOP,
    MPIDU_SCHED_ENTRY_CB,
    MPIDU_SCHED_ENTRY_INVALID_UB
};

status is used for handling different stages of a schedule entry. SEND and RECV entries will change their status from NOT_STARTED to STARTED, then to COMPLETE. REDUCE, COPY and CB entries will change directly from NOT_STARTED to COMPLETE in MPIDU_Sched_start_entry.

enum MPIDU_Sched_entry_status {
    MPIDU_SCHED_ENTRY_STATUS_NOT_STARTED = 0,
    MPIDU_SCHED_ENTRY_STATUS_STARTED,
    MPIDU_SCHED_ENTRY_STATUS_COMPLETE,
    MPIDU_SCHED_ENTRY_STATUS_FAILED, /* indicates a failure occurred while executing the entry */
    MPIDU_SCHED_ENTRY_STATUS_INVALID /* indicates an invalid entry, or invalid status value */
};

is_barrier is used to control the calling order of the entries in a schedule. The entries marked as is_barrier will not surpass its predecessor. It will control the behavior of a entry in function MPIDU_Sched_continue and MPIDU_Sched_progress_state.

MPIDU_Sched_start_entry

All types are called inside MPIDU_Sched_start_entry. It is called in the following order:

MPID_Sched_start
MPIDU_Sched_continue
MPIDU_Sched_start_entry

For entries with a SCHEDULE/COPY/CB type, their status will be changed from NOT_STARTED to COMPLETE after the call.

For entries with a SEND/RECV type, their status will be changed from NOT_STARTED to STARTED.

MPIDU_Sched_progress_state

Only SEND and RECV are called inside MPIDU_Sched_progress_state because only the entries in these two types has three status instead of two, as mention before.

MPID_Sched_barrier

MPID_Sched_barrier mark its predecessor entry's is_barrier as 1.

Algorithm

The progress engine is a part of non-blocking calls to overlap computation and communication. In order to achieve overlap, a non-blocking calls only adds an schedule to the progress engine and returns immediately.

The question is: when is the progress engine called? See following examples:

  • MPI_Comm_idup

The code snippet is like this:

for (i = 0; i < NUM_ITER; i++)
    MPI_Comm_idup(MPI_COMM_WORLD, &comms[i], &req[i])
MPI_Waitall(NUM_ITER, req, MPI_STATUSES_INGORE);

When MPI_Comm_idup is called, it registers a callback funtion gcn_helper in the progress engine. This function is not called until MPI_Waitall is called. The call stack is:

gcn_helper
MPIDU_Sched_start_entry
MPIDU_Sched_continue
MPIDU_Sched_progress_state
MPIDU_Sched_progress
MPIDI_CH3I_Progress
MPIR_Waitall_impl
MPI_Waitall

Reference