Category:Design Documents/index.php/Comm split type topo.php
This document presents topology aware communicator split in MPI. The feature presents a portable way to create communicators ensuring locality without extending the MPI standards.
Comm_split_type takes two arguments to split a communicator namely split key and color where split key indicates the criteria to split the communicator by. Color argument indicates that the processes calling split with the same color belong to the same newly created communicator. This argument may be ignored in the presence of certain split arguments. For example, a split key "MPI_COMM_SPLIT_SHARED" indicates that the communicator ought to be partitioned to create new communicators that share the same shared memory domain. This approach, however, is agnostic to hardware components such as cache hierarchy. In order to facilitate creation of communicators at different levels of shared memory hierarchy, we support user-level hints via info arguments to MPI_Comm_Split_Type. The hints cover on-node as well as off-node hardware components to partition the communicator by. Topology discovery and cpu binding information is obtained using hwloc topology library. Hwloc is designed for portable topology and cpuset query across various platforms. The hardware topology and the cpu binding is queried using hwloc during MPI_Init and cached for use later.
On-node comm_split corresponds to splitting the communicator for on-node hardware components such as NUMA/L1 cache domains. The user can specify hardware component type as a hint using the info key shmem_topo. The value set to this info key currently for non-io hardware components supported are listed as follows (The values are case sensitive).
We also support io device info arguments such as pci where the pci device id needs to be passed as an argument. Communicator is then split such that the processes bound to the non-io object attached to the pci device form a new communicator. The supported io device info hints are of the format described below.
- pci:<bus_id> for processes close to a PCI device identified by its bus id.
- ib[<id>] where id uniquely identifies the infiniband device on the node. "id" is an optional argument and when not passed, all processes `close` to an infiniband device are grouped together into a new communicator. Infiniband devices are generally numbered starting from 0. A valid info hint could be of the form "ib0".
- gpu[<id>] for GPU devices with the same description of the format as infiniband devices.
- en[<id>] for ethernet devices with same description of the format as pci and GPU devices. A valid id example would be "p22s0f0" and is optional.
- hfi[<id>] for Intel Omnipath devices. A valid hint would be of the form "hfi1_0". Id field is optional.
- MPI standard dictates that all the info hints match across processes and hence, the same key-value pair needs to be passed in all processes participating in the comm_split. If not, a node-wide communicator is returned.
- The above listed hint values are considered as "legal" hints. If a hint that does not belong to the list above is passed, a node wide communicator is returned.
- A hint in the above list may be passed but may not be found in the hardware substrate . Alternatively, the process invoking comm_split may have a bind set that covers a set of hardware resources which is not a subset of the resources covered by the user provided hint. In either case, the comm_split call returns a NULL communicator.
Off node comm_split refers to creating sub communicators that may span across nodes for shared off node network components. This component is work in progress.
Use Case Example
This category currently contains no pages or media.