Difference between revisions of "Frequently Asked Questions"

From Mpich
Jump to: navigation, search
(Running MPI Programs)
Line 297: Line 297:
The second reason is that your /etc/hosts files might not be consistent. Unless you are not using an external DNS server, the /etc/hosts file on every machine should contain the correct IP information about all hosts in the system.
The second reason is that your /etc/hosts files might not be consistent. Unless you are not using an external DNS server, the /etc/hosts file on every machine should contain the correct IP information about all hosts in the system.
== Debugging MPI Programs ==
== Debugging MPI Programs ==

Revision as of 14:27, 11 November 2010


General Information

Q: What is MPICH2?

A: MPICH2 is a freely available, portable implementation of MPI, the Standard for message-passing libraries. It implements MPI-1, MPI-2, MPI-2.1 and MPI-2.2.

Q: What does MPICH stand for?

A: MPI stands for Message Passing Interface. The CH comes from Chameleon, the portability layer used in the original MPICH to provide portability to the existing message-passing systems.

Q: Can MPI be used to program multicore systems?

A: There are two common ways to use MPI with multicore processors or multiprocessor nodes:

  1. Use one MPI process per core (here, a core is defined as a program counter and some set of arithmetic, logic, and load/store units).
  2. Use one MPI process per node (here, a node is defined as a collection of cores that share a single address space). Use threads or compiler-provided parallelism to exploit the multiple cores. OpenMP may be used with MPI; the loop-level parallelism of OpenMP may be used with any implementation of MPI (you do not need an MPI that supports MPI_THREAD_MULTIPLE when threads are used only for computational tasks). This is sometimes called the hybrid programming model.

MPICH2 automatically recognizes multicore architectures and optimizes communication for such platforms. No special configure option is required.

Building MPICH2

Q: What are process managers?

A: Process managers are basically external (typically distributed) agents that spawn and manage parallel jobs. These process managers communicate with MPICH2 processes using a predefined interface called as PMI (process management interface). Since the interface is (informally) standardized within MPICH2 and its derivatives, you can use any process manager from MPICH2 or its derivatives with any MPI application built with MPICH2 or any of its derivatives, as long as they follow the same wire protocol. There are three known implementations of the PMI wire protocol: "simple", "smpd" and "slurm". By default, MPICH2 and all its derivatives use the "simple" PMI wire protocol, but MPICH2 can be configured to use "smpd" or "slurm" as well.

For example, MPICH2 provides several different process managers such as Hydra, MPD, Gforker and Remshell which follow the "simple" PMI wire protocol. MVAPICH2 provides a different process manager called "mpirun" that also follows the same wire protocol. OSC mpiexec follows the same wire protocol as well. You can mix and match an application built with any MPICH2 derivative with any process manager. For example, an application built with Intel MPI can run with OSC mpiexec or MVAPICH2's mpirun or MPICH2's Gforker.

MPD has been the traditional default process manager for MPICH2 till the 1.2.x series of MPICH2. Starting the 1.3.x series, Hydra is the default process manager.

SMPD is another process manager distributed with MPICH2 that uses the "smpd" PMI wire protocol. This is mainly used for running MPICH2 on Windows or a combination of UNIX and Windows machines. This will be deprecated in the future releases of MPICH2 in favour or Hydra. MPICH2 can be configured with SMPD using:

 ./configure --with-pm=smpd --with-pmi=smpd

SLURM is an external process manager that uses MPICH2's PMI interface as well. MPICH2 can be configured to use SLURM's PMI using:

 ./configure --with-pm=none --with-pmi=slurm

Once configured with slurm, no internal process manager is built for MPICH2; the user is expected to use SLURM's launch models (such as srun).

Q: Do I have to configure/make/install MPICH2 each time for each compiler I use?

A: No, in many cases you can build MPICH2 using one set of compilers and then use the libraries (and compilation scripts) with other compilers. However, this depends on the compilers producing compatible object files. Specifically, the compilers must

  1. Support the same basic datatypes with the same sizes. For example, the C compilers should use the same sizes for long long and long double.
  2. Map the names of routines in the source code to names in the object files in the object file in the same way. This can be a problem for Fortran and C++ compilers, though you can often force the Fortran compilers to use the same name mapping. More specifically, most Fortran compilers map names in the source code into all lower-case with one or two underscores appended to the name. To use the same MPICH2 library with all Fortran compilers, those compilers must make the same name mapping. There is one exception to this that is described below.
  3. Perform the same layout for C structures. The C langauge does not specify how structures are layed out in memory. For 100\% compatibility, all compilers must follow the same rules. However, if you do not use any of the MPI_MIN_LOC or MPI_MAX_LOC datatypes, and you do not rely on the MPICH2 library to set the extent of a type created with MPI_Type_struct or MPI_Type_create_struct, you can often ignore this requirement.
  4. Require the same additional runtime libraries. Not all compilers will implement the same version of Unix, and some routines that MPICH2 uses may be present in only some of the run time libraries associated with specific compilers.

The above may seem like a stringent set of requirements, but in practice, many systems and compiler sets meet these needs, if for no other reason than that any software built with multiple libraries will have requirements similar to those of MPICH2 for compatibility.

If your compilers are completely compatible, down to the runtime libraries, you may use the compilation scripts (mpicc etc.) by either specifying the compiler on the command line, e.g.

mpicc -cc=icc -c foo.c

or with the environment variables MPICH_CC etc. (this example assume a c-shell syntax):

setenv MPICH_CC icc
mpicc -c foo.c

If the compiler is compatible except for the runtime libraries, then this same format works as long as a configuration file that describes the necessary runtime libraries is created and placed into the appropriate directory (the "sysconfdir" directory in configure terms). See the installation manual for more details.

In some cases, MPICH2 is able to build the Fortran interfaces in a way that supports multiple mappings of names from the Fortran source code to the object file. This is done by using the "multiple weak symbol" support in some environments. For example, when using gcc under Linux, this is the default.

Q: How do I configure to use the Absoft Fortran compilers?

A: You can find build instructions on the Absoft web site at the bottom of the page http://www.absoft.com/Products/Compilers/Fortran/Linux/fortran95/MPich_Instructions.html

Q: When I configure MPICH2, I get a message about FDZERO and the configure aborts.

A: FD_ZERO is part of the support for the select calls (see ``man select or ``man 2 select on Linux and many other Unix systems) . What this means is that your system (probably a Mac) has a broken version of the select call and related data types. This is an OS bug; the only repair is to update the OS to get past this bug. This test was added specifically to detect this error; if there was an easy way to work around it, we would have included it (we don't just implement FD_ZERO ourselves because we don't know what else is broken in this implementation of select).

If this configure works with gcc but not with xlc, then the problem is with the include files that xlc is using; since this is an OS call (even if emulated), all compilers should be using consistent if not identical include files. In this case, you may need to update xlc.

Q: When I use the g95 Fortran compiler on a 64-bit platform, some of the tests fail.

A: The g95 compiler incorrectly defines the default Fortran integer as a 64-bit integer while defining Fortran reals as 32-bit values (the Fortran standard requires that INTEGER and REAL be the same size). This was apparently done to allow a Fortran INTEGER to hold the value of a pointer, rather than requiring the programmer to select an INTEGER of a suitable KIND. To force the g95 compiler to correctly implement the Fortran standard, use the -i4 flag. For example, set the environment variable F90FLAGS before configuring MPICH2:

setenv F90FLAGS "-i4"

G95 users should note that there (at this writing) are two distributions of g95 for 64-bit Linux platforms. One uses 32-bit integers and reals (and conforms to the Fortran standard) and one uses 32-bit integers and 64-bit reals. We recommend using the one that conforms to the standard (note that the standard specifies the ratio of sizes, not the absolute sizes, so a Fortran 95 compiler that used 64 bits for both INTEGER and REAL would also conform to the Fortran standard. However, such a compiler would need to use 128 bits for DOUBLE PRECISION quantities).

Q: Make fails with errors such as these:
sock.c:8:24: mpidu_sock.h: No such file or directory
In file included from sock.c:9:
mpidpre.h: No such file or directory
In file included from sock.c:9:
error: syntax error before "MPID_VCRT"
../../../../include/mpiimpl.h:1150: warning: no semicolon at end of struct or union

A: Check if you have set the envirnoment variable CPPFLAGS. If so, unset it and use CXXFLAGS instead. Then rerun configure and make.

Q: When building the ssm channel, I get this error:
mpidu_process_locks.h:234:2: error: \#error *** No atomic memory operation specified to implement busy locks ***

A: The ssm channel does not work on all platforms because they use special interprocess locks (often assembly) that may not work with some compilers or machine architectures. It works on Linux with gcc, Intel, and Pathscale compilers on various Intel architectures. It also works in Windows and Solaris environments.

This channel is now deprecated. Please use the ch3:nemesis channel, which is more portable and performs better than ssm.

Q: When using the Intel Fortran 90 compiler (version 9), the make fails with errors in compiling statement that reference MPI_ADDRESS_KIND.

A: Check the output of the configure step. If configure claims that ifort is a cross compiler, the likely problem is that programs compiled and linked with ifort cannot be run because of a missing shared library. Try to compile and run the following program (named conftest.f90):

program conftest
integer, dimension(10) :: n

If this program fails to run, then the problem is that your installation of ifort either has an error or you need to add additional values to your environment variables (such as LD_LIBRARY_PATH). Check your installation documentation for the ifort compiler. See http://softwareforums.intel.com/ISN/Community/en-US/search/SearchResults.aspx?q=libimf.so for an example of problems of this kind that users are having with version 9 of ifort.

If you do not need Fortran 90, you can configure with --disable-f90.

Q: The build fails when I use parallel make.

A: Parallel make (often invoked with make -j4) will cause several job steps in the build process to update the same library file (libmpich.a) concurrently. Unfortunately, neither the ar nor the ranlib programs correctly handle this case, and the result is a corrupted library. For now, the solution is to not use a parallel make when building MPICH2.

Q: I get a configure error saying "Incompatible Fortran and C Object File Types!"

A: This is a problem with the default compilers available on Mac OS: it provides a it provides a 32-bit C compiler and a 64-bit Fortran compiler (or the other way around). These two are not compatible with each other. Consider installing the same architecture compilers. Alternatively, if you do not need to build Fortran programs, you can disable it with the configure option --disable-f77 --disable-f90.

Compiling MPI Programs

Q: I get compile errors saying "SEEK_SET is #defined but must not be for the C++ binding of MPI".

A: This is really a problem in the MPI-2 standard. And good or bad, the MPICH2 implementation has to adhere to it. The root cause of this error is that both stdio.h and the MPI C++ interface use SEEK_SET, SEEK_CUR, and SEEK_END. You can try adding:

#undef SEEK_SET
#undef SEEK_END
#undef SEEK_CUR

before mpi.h is included, or add the definition


to the command line (this will cause the MPI versions of SEEK_SET etc. to be skipped).

Q: I get compile errors saying "error C2555: 'MPI::Nullcomm::Clone' : overriding virtual function differs from 'MPI::Comm::Clone' only by return type or calling convention".

A: This is caused by buggy C++ compilers not implementing part of the C++ standard. To work around this problem, add the definition:


to the CXXFLAGS variable or add a:


before including mpi.h

Running MPI Programs

Q: How do I pass environment variables to the processes of my parallel program?

A: The specific method depends on the process manager and version of mpiexec that you are using. See the appropriate specific section.

Q: How do I pass environment variables to the processes of my parallel program when using the mpd, hydra or gforker process manager?

A: By default, all the environment variables in the shell where mpiexec is run are passed to all processes of the application program. (The one exception is LD_LIBRARY_PATH when using MPD and the mpd's are being run as root.) This default can be overridden in many ways, and individual environment variables can be passed to specific processes using arguments to mpiexec. A synopsis of the possible arguments can be listed by typing:

mpiexec -help

and further details are available in the Users Guide here: http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs.

Q: What determines the hosts on which my MPI processes run?

A: Where processes run, whether by default or by specifying them yourself, depends on the process manager being used.

If you are using the Hydra process manager, the host file can contain the number of processes you want on each node. More information on this can be found here.

If you are using the gforker process manager, then all MPI processes run on the same host where you are running mpiexec.

If you are using the mpd process manager, which is the default, then many options are available. If you are using mpd, then before you run mpiexec, you will have started, or will have had started for you, a ring of processes called mpd's (multi-purpose daemons), each running on its own host. It is likely, but not necessary, that each mpd will be running on a separate host. You can find out what this ring of hosts consists of by running the program mpdtrace. One of the mpd's will be running on the ``local machine, the one where you will run mpiexec. The default placement of MPI processes, if one runs

mpiexec -n 10 a.out

is to start the first MPI process (rank 0) on the local machine and then to distribute the rest around the mpd ring one at a time. If there are more processes than mpd's, then wraparound occurs. If there are more mpd's than MPI processes, then some mpd's will not run MPI processes. Thus any number of processes can be run on a ring of any size. While one is doing development, it is handy to run only one mpd, on the local machine. Then all the MPI processes will run locally as well.

The first modification to this default behavior is the -1 option to mpiexec (not a great argument name). If -1 is specified, as in

mpiexec -1 -n 10 a.out

then the first application process will be started by the first mpd in the ring after the local host. (If there is only one mpd in the ring, then this will be on the local host.) This option is for use when a cluster of compute nodes has a ``head node where commands like mpiexec are run but not application processes.

If an mpd is started with the --ncpus option, then when it is its turn to start a process, it will start several application processes rather than just one before handing off the task of starting more processes to the next mpd in the ring. For example, if the mpd is started with

mpd --ncpus=4

then it will start as many as four application processes, with consecutive ranks, when it is its turn to start processes. This option is for use in clusters of SMP's, when the user would like consecutive ranks to appear on the same machine. (In the default case, the same number of processes might well run on the machine, but their ranks would be different.)

(A feature of the --ncpus=[n] argument is that it has the above effect only until all of the mpd's have started n processes at a time once; afterwards each mpd starts one process at a time. This is in order to balance the number of processes per machine to the extent possible.)

Other ways to control the placement of processes are by direct use of arguments to mpiexec. See the Users Guide here: http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs

Q: My output does not appear until the program exits.

A: Output to stdout may not be written from your process immediately after a printf or fprintf (or PRINT in Fortran) because, under Unix, such output is buffered unless the program believes that the output is to a terminal. When the program is run by mpiexec, the C standard I/O library (and normally the Fortran runtime library) will buffer the output. For C programmers, you can either use a call fflush(stdout) to force the output to be written or you can set no buffering by calling:

#include <stdio.h>
setvbuf( stdout, NULL, _IONBF, 0 );

on each file descriptor (stdout in this example) which you want to send the output immediately to your terminal or file.

There is no standard way to either change the buffering mode or to flush the output in Fortran. However, many Fortrans include an extension to provide this function. For example, in g77,

call flush()

can be used. The xlf compiler supports

call flush_(6)

where the argument is the Fortran logical unit number (here 6, which is often the unit number associated with PRINT). With the G95 Fortran 95 compiler, set the environment variable G95_UNBUFFERED_6 to cause output to unit 6 to be unbuffered.

In C stderr is not buffered. So, alternatively, you can write your output to stderr instead.

Q: Fortran programs using stdio fail when using g95.

A: By default, g95 does not flush output to stdout. This also appears to cause problems for standard input. If you are using the Fortran logical units 5 and 6 (or the * unit) for standard input and output, set the environment variable G95_UNBUFFERED_6 to yes.

Q: How do I run MPI programs in the background when using the default MPD process manager?

A: To run MPI programs in the background when using MPD, you need to redirect stdin from /dev/null. For example:

mpiexec -n 4 a.out < /dev/null &

Q: How do I use MPICH2 with slurm?

A: To use MPICH2 with slurm, you have to configure MPICH2 with the following options:

./configure --with-pmi=slurm --with-pm=no

In addition, if your slurm installation is not in the default location, you will need to pass the actual installation location using:

./configure --with-pmi=slurm --with-pm=no --with-slurm=[path_to_slurm_install]
Q: All my processes get rank 0.

A: This problem occurs when there is a mismatch between the process manager (PM) used and the process management interface (PMI) with which the MPI application is compiled.

MPI applications use process managers to launch them as well as get information such as their rank, the size of the job, etc. MPICH2 specified an interface called the process management interface (PMI) that is a set of functions that MPICH2 internals (or the internals of other parallel programming models) can use to get such information from the process manager. However, this specification did not include a wire protocol, i.e., how the client-side part of the PMI would talk to the process manager. Thus, many groups implemented their own PMI library in ways that were not compatible with each other with respect to the wire protocol (the interface is still common and as specified). Some examples of PMI library implementations are: (a) simple PMI (MPICH2's default PMI library), (b) smpd PMI (for linux/windows compatibility; will be deprecated soon) and (c) slurm PMI (implemented by the slurm guys).

MPD, Gforker, Remshell, Hydra, OSC mpiexec, OSU mpirun and probably many other process managers use the simple PMI wire protocol. So, as long as the MPI application is linked with the simple PMI library, you can use any of these process managers interchangeably. Simple PMI library is what you are linked to by default when you build MPICH2 using the default options.

srun uses slurm PMI. When you configure MPICH2 using --with-pmi=slurm, it links with the slurm PMI library. Only srun is compatible with this slurm PMI library, so only that can be used. The slurm folks came out with their own "mpiexec" executable, which essentially wraps around srun, so that uses the slurm PMI as well.

So, in some sense, mpiexec or srun is just a user interface for you to talk in the appropriate PMI wire protocol. If you have a mismatch, the MPI process will not be able to detect their rank, the job size, etc., so all processes think they are rank 0.

Q: How do I control which ports MPICH2 uses?

A: The MPICH_PORT_RANGE environment variable allows you to specify the range of TCP ports to be used by the process manager and the MPICH2 library. Set this variable before starting your application with mpiexec. The format of this variable is <low>:<high>. For example, to allow the job launcher and MPICH2 to use ports only between 10000 and 10100, if you're using the bash shell, you would use:

export MPICH_PORT_RANGE=10000:10100

Q: Why does my MPI program run much slower when I use more processes?

A: The default channel in MPICH2 (starting with the 1.1 series) is ch3:nemesis. This channel uses busy polling in order to improve intranode shared-memory communication performance. The downside to this is that performance will generally take a dramatic hit if you oversubscribe your nodes. Oversusbscription is the case where you run more processes on a node than there are cores on the node. In this scenario, you have a few choices:

  1. Just don't run with more processes than you have cores available. This may not be an option depending on what you are trying to accomplish.
  2. Run your code on more nodes or on the same number of nodes but with larger per-node core counts. That is, your job size should not exceed the total core count for the system on which you are running your job. Again, this may not be an option for you, since you might not have access to additional computers.
  3. Configure your MPICH2 installation with --with-device=ch3:sock. This will use the older ch3:sock channel that does not busy poll. This channel will be slower for intra-node communication, but it will perform much better in the oversubscription scenario.

Q: My MPI program aborts with an error saying it cannot communicate with other processes

A: There are two possible reasons for this. The first is firewalls. Make sure the firewalls are turned off on all machines. If you don't want to control the firewalls, see above on how you can open a few ports in the firewall, and ask MPICH2 to use those ports.

The second reason is that your /etc/hosts files might not be consistent. Unless you are not using an external DNS server, the /etc/hosts file on every machine should contain the correct IP information about all hosts in the system.

Debugging MPI Programs

Q: How do I use Totalview with MPICH2?

A: Totalview allows multiple levels of debugging for MPI programs. If you need to debug your application without any information from the MPICH2 stack, you just need to compile your program with mpicc -g (or mpif77 -g, etc) and run your application as:

 totalview mpiexec -a -f machinefile ./foo

The "-a" is a totalview specific option that is not interpreted by mpiexec.

Totalview also allows you to peep into the internals of the MPICH2 stack to query information that might sometimes be helpful for debugging. To allow MPICH2 to expose such information, you need to configure MPICH2 as:

 ./configure --enable-debuginfo


Q: Why am I getting so many unexpected messages?

A: If a process is receiving too many unexpected messages,your application may fail with a message similar to this:

 Failed to allocate memory for an unexpected message. 261894 unexpected messages queued.

Receiving too many unexpected messages is a common bug people hit. This is often caused by some processes "running ahead" of others.

First, what is an unexpected message? An unexpected message is a message which has been received by the MPI library for which a receive hasn't been posted (i.e., the program has not called a receive function like MPI_Recv or MPI_Irecv). What happens is that for small messages ("small" being determined by the particular MPI library and/or interconnect you're using) when a process receives an unexpected message, the library stores a copy of the message internally. When the application posts a receive matching the unexpected message, the data is copied out of the internal buffer and the internal buffer is freed. The problem occurs when a process has to store too many unexpected messages: eventually it will run out of memory. This problem may happen even if the program has a matching receive for every message being sent to it.

Consider a program where one process receives a message then does some computation on it repeatedly in a loop, and another process sends messages to the other process, also in a loop. Because the first process spends some time processing each message it'll probably run slower than second, so the second process will end up sending messages faster than the first process can receive them.

The receiving process will end up with all of these unexpected messages because it hasn't been able to post the receives fast enough. Note that this can happen even if you're using blocking sends: Remember that MPI_Send returns when the send buffer is free to be reused, and not necessarily when the receiver has received the message.

Another place where you can run into this problem of unexpected messages is with collectives. People often think of collective operations as synchronizing operations. This is not always the case. Consider reduce. A reduction operation is typically performed in a tree fashion, where each process receives messages from it's children performs the operation and sends the result to it's parent. If a process which happens to be a leaf of the tree calls MPI_Reduce before it's parent process does, it will result in an unexpected message at the parent. Note also that the leaf process may return from MPI_Reduce before it's parent even calls MPI_Reduce. Now look what happens if you have a loop with MPI_Reduce in it. Because the non-leaf nodes have to receive messages from several children perform a calculation and send the result, they will run slower than the leaf nodes which only have to send a single message, so you may end up with a "unexpected message storm" similar to the one above.

So how can you fix the problem? See if you can rearrange your code to get rid of loops like the ones described above. Otherwise, you'll probably have to introduce some synchronization between the processes which may affect performance. For loops with collectives, you can add an MPI_Barrier in the loop. For loops with sends/receives you can use synchronous sends (MPI_Ssend, and friends) or have the sender wait for an explicit ack message from the receiver. Of course you could optimize this where you're not doing the synchronization in every iteration of the loop, e.g., call MPI_Barrier every 100th iteration.

Q: My MPD ring won't start, what's wrong?

A: MPD is a temperamental piece of software and can fail to work correctly for a variety of reasons. Most commonly, however, networking configuration problems are the root cause. In general, we recommend Using the Hydra Process Manager instead of MPD. Hydra is the default process manager starting MPICH2-1.3. If you need to continue using MPD for some reason, here are some suggestions to try out.

If you see error messages that look like any of the following, please try the troubleshooting steps listed in Appendix A of the MPICH2 Installer's Guide:

% mpdboot -n 2 -f  mpd.hosts
mpdboot_hostname (handle_mpd_output 374): failed to ping mpd on node1; recvd output={}

[1]+  Done                    mpd

Case 1: start mpd on n1, OK.
Case 2: start “mpd --trace" on n2, stuck at:
            n2_60857: ENTER mpd_sockpair in
/home/sl/mpi/bin/mpdlib.py at line 213; ARGS=()
Case 3: start "mpdboot -n 2 -f mpd.hosts" on n1, OK. mpdtrace on each
node shows "n1, n2"
           mpdringtest OK.
           mpiexec -n 2 date   // Fail, mpiexec_n1 (mpiexec 392): no msg recvd from mpd when expecting ack of request

Case 4: start "mpdboot -n 2 -f mpd.hosts" on n2, failed with nothing output.

In particular, make sure to try the steps that involve the mpdcheck utility.

If you are using MPD on CentOS Linux and mpdboot hangs and then elicits the following messages upon CTRL-C:

^[[CTraceback (most recent call last):
 File "/nfs/home/atchley/projects/mpich2-mx-1.2.1..6/build/shower/bin/mpdboot", line 476, in ?
 File "/nfs/home/atchley/projects/mpich2-mx-1.2.1..6/build/shower/bin/mpdboot", line 347, in mpdboot
 File "/nfs/home/atchley/projects/mpich2-mx-1.2.1..6/build/shower/bin/mpdboot", line 385, in handle_mpd_output
   for line in fd.readlines():    # handle output from shells that echo stuff

then you should see #974. This is a known issue that we have yet to resolve.