Frequently Asked Questions

From Mpich
Revision as of 19:37, 2 January 2010 by Balaji (talk | contribs) (Q: What is MPICH2?)

Jump to: navigation, search


General Information

Q: What is MPICH2?

A: MPICH2 is a freely available, portable implementation of MPI, the Standard for message-passing libraries. It implements MPI-1, MPI-2, MPI-2.1 and MPI-2.2.

Q: What does MPICH stand for?

A: MPI stands for Message Passing Interface. The CH comes from Chameleon, the portability layer used in the original MPICH to provide portability to the existing message-passing systems.

Q: Can MPI be used to program multicore systems?

A: There are two common ways to use MPI with multicore processors or multiprocessor nodes:

  1. Use one MPI process per core (here, a core is defined as a program counter and some set of arithmetic, logic, and load/store units).
  2. Use one MPI process per node (here, a node is defined as a collection of cores that share a single address space). Use threads or compiler-provided parallelism to exploit the multiple cores. OpenMP may be used with MPI; the loop-level parallelism of OpenMP may be used with any implementation of MPI (you do not need an MPI that supports MPI_THREAD_MULTIPLE when threads are used only for computational tasks). This is sometimes called the hybrid programming model.

Building MPICH2

Q: What is the difference between the MPD and SMPD process managers?

A: MPD is the default process manager for MPICH2 on Unix platforms. It is written in Python. SMPD is the primary process manager for MPICH2 on Windows. It is also used for running on a combination of Windows and Linux machines. It is written in C.

Q: Do I have to configure/make/install MPICH2 each time for each compiler I use?

A: No, in many cases you can build MPICH2 using one set of compilers and then use the libraries (and compilation scripts) with other compilers. However, this depends on the compilers producing compatible object files. Specifically, the compilers must

  1. Support the same basic datatypes with the same sizes. For example, the C compilers should use the same sizes for long long and long double.
  2. Map the names of routines in the source code to names in the object files in the object file in the same way. This can be a problem for Fortran and C++ compilers, though you can often force the Fortran compilers to use the same name mapping. More specifically, most Fortran compilers map names in the source code into all lower-case with one or two underscores appended to the name. To use the same MPICH2 library with all Fortran compilers, those compilers must make the same name mapping. There is one exception to this that is described below.
  3. Perform the same layout for C structures. The C langauge does not specify how structures are layed out in memory. For 100\% compatibility, all compilers must follow the same rules. However, if you do not use any of the MPI_MIN_LOC or MPI_MAX_LOC datatypes, and you do not rely on the MPICH2 library to set the extent of a type created with MPI_Type_struct or MPI_Type_create_struct, you can often ignore this requirement.
  4. Require the same additional runtime libraries. Not all compilers will implement the same version of Unix, and some routines that MPICH2 uses may be present in only some of the run time libraries associated with specific compilers.

The above may seem like a stringent set of requirements, but in practice, many systems and compiler sets meet these needs, if for no other reason than that any software built with multiple libraries will have requirements similar to those of MPICH2 for compatibility.

If your compilers are completely compatible, down to the runtime libraries, you may use the compilation scripts (mpicc etc.) by either specifying the compiler on the command line, e.g.

mpicc -cc=icc -c foo.c

or with the environment variables MPICH_CC etc. (this example assume a c-shell syntax):

setenv MPICH_CC icc
mpicc -c foo.c

If the compiler is compatible except for the runtime libraries, then this same format works as long as a configuration file that describes the necessary runtime libraries is created and placed into the appropriate directory (the "sysconfdir" directory in configure terms). See the installation manual for more details.

In some cases, MPICH2 is able to build the Fortran interfaces in a way that supports multiple mappings of names from the Fortran source code to the object file. This is done by using the "multiple weak symbol" support in some environments. For example, when using gcc under Linux, this is the default.

Q: How do I configure to use the Absoft Fortran compilers?

A: You can find build instructions on the Absoft web site at the bottom of the page

Q: When I configure MPICH2, I get a message about FDZERO and the configure aborts.

A: FD_ZERO is part of the support for the select calls (see ``man select or ``man 2 select on Linux and many other Unix systems) . What this means is that your system (probably a Mac) has a broken version of the select call and related data types. This is an OS bug; the only repair is to update the OS to get past this bug. This test was added specifically to detect this error; if there was an easy way to work around it, we would have included it (we don't just implement FD_ZERO ourselves because we don't know what else is broken in this implementation of select).

If this configure works with gcc but not with xlc, then the problem is with the include files that xlc is using; since this is an OS call (even if emulated), all compilers should be using consistent if not identical include files. In this case, you may need to update xlc.

Q: When I use the g95 Fortran compiler on a 64-bit platform, some of the tests fail.

A: The g95 compiler incorrectly defines the default Fortran integer as a 64-bit integer while defining Fortran reals as 32-bit values (the Fortran standard requires that INTEGER and REAL be the same size). This was apparently done to allow a Fortran INTEGER to hold the value of a pointer, rather than requiring the programmer to select an INTEGER of a suitable KIND. To force the g95 compiler to correctly implement the Fortran standard, use the -i4 flag. For example, set the environment variable F90FLAGS before configuring MPICH2:

setenv F90FLAGS "-i4"

G95 users should note that there (at this writing) are two distributions of g95 for 64-bit Linux platforms. One uses 32-bit integers and reals (and conforms to the Fortran standard) and one uses 32-bit integers and 64-bit reals. We recommend using the one that conforms to the standard (note that the standard specifies the ratio of sizes, not the absolute sizes, so a Fortran 95 compiler that used 64 bits for both INTEGER and REAL would also conform to the Fortran standard. However, such a compiler would need to use 128 bits for DOUBLE PRECISION quantities).

Q: Make fails with errors such as these:
sock.c:8:24: mpidu_sock.h: No such file or directory
In file included from sock.c:9:
mpidpre.h: No such file or directory
In file included from sock.c:9:
error: syntax error before "MPID_VCRT"
../../../../include/mpiimpl.h:1150: warning: no semicolon at end of struct or union

A: Check if you have set the envirnoment variable CPPFLAGS. If so, unset it and use CXXFLAGS instead. Then rerun configure and make.

Q: When building the ssm channel, I get this error:
mpidu_process_locks.h:234:2: error: \#error *** No atomic memory operation specified to implement busy locks ***

A: The ssm channel does not work on all platforms because they use special interprocess locks (often assembly) that may not work with some compilers or machine architectures. It works on Linux with gcc, Intel, and Pathscale compilers on various Intel architectures. It also works in Windows and Solaris environments.

Q: When using the Intel Fortran 90 compiler (version 9), the make fails with errors in compiling statement that reference MPI_ADDRESS_KIND.

A: Check the output of the configure step. If configure claims that ifort is a cross compiler, the likely problem is that programs compiled and linked with ifort cannot be run because of a missing shared library. Try to compile and run the following program (named conftest.f90):

program conftest
integer, dimension(10) :: n

If this program fails to run, then the problem is that your installation of ifort either has an error or you need to add additional values to your environment variables (such as LD_LIBRARY_PATH). Check your installation documentation for the ifort compiler. See for an example of problems of this kind that users are having with version 9 of ifort.

If you do not need Fortran 90, you can configure with --disable-f90.

Q: The build fails when I use parallel make.

A: Parallel make (often invoked with make -j4) will cause several job steps in the build process to update the same library file (libmpich.a) concurrently. Unfortunately, neither the ar nor the ranlib programs correctly handle this case, and the result is a corrupted library. For now, the solution is to not use a parallel make when building MPICH2.

Compiling MPI Programs

Q: I get compile errors saying "SEEK_SET is #defined but must not be for the C++ binding of MPI".

A: This is really a problem in the MPI-2 standard. And good or bad, the MPICH2 implementation has to adhere to it. The root cause of this error is that both stdio.h and the MPI C++ interface use SEEK_SET, SEEK_CUR, and SEEK_END. You can try adding:

#undef SEEK_SET
#undef SEEK_END
#undef SEEK_CUR

before mpi.h is included, or add the definition


to the command line (this will cause the MPI versions of SEEK_SET etc. to be skipped).

Q: I get compile errors saying "error C2555: 'MPI::Nullcomm::Clone' : overriding virtual function differs from 'MPI::Comm::Clone' only by return type or calling convention".

A: This is caused by buggy C++ compilers not implementing part of the C++ standard. To work around this problem, add the definition:


to the CXXFLAGS variable or add a:


before including mpi.h

Running MPI Programs

Q: How do I pass environment variables to the processes of my parallel program?

A: The specific method depends on the process manager and version of mpiexec that you are using. See the appropriate specific section.

Q: How do I pass environment variables to the processes of my parallel program when using the mpd, hydra or gforker process manager?

A: By default, all the environment variables in the shell where mpiexec is run are passed to all processes of the application program. (The one exception is LD_LIBRARY_PATH when using MPD and the mpd's are being run as root.) This default can be overridden in many ways, and individual environment variables can be passed to specific processes using arguments to mpiexec. A synopsis of the possible arguments can be listed by typing:

mpiexec -help

and further details are available in the Users Guide here:

Q: What determines the hosts on which my MPI processes run?

A: Where processes run, whether by default or by specifying them yourself, depends on the process manager being used.

If you are using the gforker process manager, then all MPI processes run on the same host where you are running mpiexec.

If you are using the mpd process manager, which is the default, then many options are available. If you are using mpd, then before you run mpiexec, you will have started, or will have had started for you, a ring of processes called mpd's (multi-purpose daemons), each running on its own host. It is likely, but not necessary, that each mpd will be running on a separate host. You can find out what this ring of hosts consists of by running the program mpdtrace. One of the mpd's will be running on the ``local machine, the one where you will run mpiexec. The default placement of MPI processes, if one runs

mpiexec -n 10 a.out

is to start the first MPI process (rank 0) on the local machine and then to distribute the rest around the mpd ring one at a time. If there are more processes than mpd's, then wraparound occurs. If there are more mpd's than MPI processes, then some mpd's will not run MPI processes. Thus any number of processes can be run on a ring of any size. While one is doing development, it is handy to run only one mpd, on the local machine. Then all the MPI processes will run locally as well.

The first modification to this default behavior is the -1 option to mpiexec (not a great argument name). If -1 is specified, as in

mpiexec -1 -n 10 a.out

then the first application process will be started by the first mpd in the ring after the local host. (If there is only one mpd in the ring, then this will be on the local host.) This option is for use when a cluster of compute nodes has a ``head node where commands like mpiexec are run but not application processes.

If an mpd is started with the --ncpus option, then when it is its turn to start a process, it will start several application processes rather than just one before handing off the task of starting more processes to the next mpd in the ring. For example, if the mpd is started with

mpd --ncpus=4

then it will start as many as four application processes, with consecutive ranks, when it is its turn to start processes. This option is for use in clusters of SMP's, when the user would like consecutive ranks to appear on the same machine. (In the default case, the same number of processes might well run on the machine, but their ranks would be different.)

(A feature of the --ncpus=[n] argument is that it has the above effect only until all of the mpd's have started n processes at a time once; afterwards each mpd starts one process at a time. This is in order to balance the number of processes per machine to the extent possible.)

Other ways to control the placement of processes are by direct use of arguments to mpiexec. See the Users Guide here:

Q: My output does not appear until the program exits.

A: Output to stdout and stderr may not be written from your process immediately after a printf or fprintf (or PRINT in Fortran) because, under Unix, such output is buffered unless the program believes that the output is to a terminal. When the program is run by mpiexec, the C standard I/O library (and normally the Fortran runtime library) will buffer the output. For C programmers, you can either use a call fflush(stdout) to force the output to be written or you can set no buffering by calling:

#include <stdio.h>
setvbuf( stdout, NULL, _IONBF, 0 );

on each file descriptor (stdout in this example) which you want to send the output immedately to your terminal or file.

There is no standard way to either change the buffering mode or to flush the output in Fortran. However, many Fortrans include an extension to provide this function. For example, in g77,

call flush()

can be used. The xlf compiler supports

call flush_(6)

where the argument is the Fortran logical unit number (here 6, which is often the unit number associated with PRINT). With the G95 Fortran 95 compiler, set the environment variable G95_UNBUFFERED_6 to cause output to unit 6 to be unbuffered.

Q: Fortran programs using stdio fail when using g95.

A: By default, g95 does not flush output to stdout. This also appears to cause problems for standard input. If you are using the Fortran logical units 5 and 6 (or the * unit) for standard input and output, set the environment variable G95_UNBUFFERED_6 to yes.

Q: How do I run MPI programs in the background when using the default MPD process manager?

A: To run MPI programs in the background when using MPD, you need to redirect stdin from /dev/null. For example:

mpiexec -n 4 a.out < /dev/null &

Q: How do I use MPICH2 with slurm?

A: To use MPICH2 with slurm, you have to configure MPICH2 with the following options:

./configure --with-pmi=slurm --with-pm=no

In addition, if your slurm installation is not in the default location, you will need to pass the actual installation location using:

./configure --with-pmi=slurm --with-pm=no --with-slurm=[path_to_slurm_install]
Q: All my processes get rank 0.

A: This problem occurs when there is a mismatch between the process manager (PM) used and the process management interface (PMI) with which the MPI application is compiled.

MPI applications use process managers to launch them as well as get information such as their rank, the size of the job, etc. MPICH2 specified an interface called the process management interface (PMI) that is a set of functions that MPICH2 internals (or the internals of other parallel programming models) can use to get such information from the process manager. However, this specification did not include a wire protocol, i.e., how the client-side part of the PMI would talk to the process manager. Thus, many groups implemented their own PMI library in ways that were not compatible with each other with respect to the wire protocol (the interface is still common and as specified). Some examples of PMI library implementations are: (a) simple PMI (MPICH2's default PMI library), (b) smpd PMI (for linux/windows compatibility; will be deprecated soon) and (c) slurm PMI (implemented by the slurm guys).

MPD, Gforker, Remshell, Hydra, OSC mpiexec, OSU mpirun and probably many other process managers use the simple PMI wire protocol. So, as long as the MPI application is linked with the simple PMI library, you can use any of these process managers interchangeably. Simple PMI library is what you are linked to by default when you build MPICH2 using the default options.

srun uses slurm PMI. When you configure MPICH2 using --with-pmi=slurm, it links with the slurm PMI library. Only srun is compatible with this slurm PMI library, so only that can be used. The slurm folks came out with their own "mpiexec" executable, which essentially wraps around srun, so that uses the slurm PMI as well.

So, in some sense, mpiexec or srun is just a user interface for you to talk in the appropriate PMI wire protocol. If you have a mismatch, the MPI process will not be able to detect their rank, the job size, etc., so all processes think they are rank 0.

Q: How do I control which ports MPICH2 uses?

A: The MPICH_PORT_RANGE environment variable allows you to specify the range of TCP ports to be used by the process manager and the MPICH2 library. Set this variable before starting your application with mpiexec. The format of this variable is <low>:<high>. For example, to allow the job launcher and MPICH2 to use ports only between 10000 and 10100, if you're using the bash shell, you would use:

export MPICH_PORT_RANGE=10000:10100
Q: Why does my MPI program run much slower when I use more processes?

A: The default channel in MPICH2 (starting with the 1.1 series) is ch3:nemesis. This channel uses busy polling in order to improve intranode shared-memory communication performance. The downside to this is that performance will generally take a dramatic hit if you oversubscribe your nodes. Oversusbscription is the case where you run more processes on a node than there are cores on the node. In this scenario, you have a few choices:

  1. Just don't run with more processes than you have cores available. This may not be an option depending on what you are trying to accomplish.
  2. Run your code on more nodes or on the same number of nodes but with larger per-node core counts. That is, your job size should not exceed the total core count for the system on which you are running your job. Again, this may not be an option for you, since you might not have access to additional computers.
  3. Configure your MPICH2 installation with --with-device=ch3:sock. This will use the older ch3:sock channel that does not busy poll. This channel will be slower for intra-node communication, but it will perform much better in the oversubscription scenario.

Debugging MPI Programs

Q: How do I use Totalview with MPICH2?

A: Totalview allows multiple levels of debugging for MPI programs. If you need to debug your application without any information from the MPICH2 stack, you just need to compile your program with mpicc -g (or mpif77 -g, etc) and run your application as:

 totalview mpiexec -a -f machinefile ./foo

The "-a" is a totalview specific option that is not interpreted by mpiexec.

Totalview also allows you to peep into the internals of the MPICH2 stack to query information that might sometimes be helpful for debugging. To allow MPICH2 to expose such information, you need to configure MPICH2 as:

 ./configure --enable-debuginfo