|
This is a short introduction to the Message Passing
Interface (MPI) system that was designed to enable parallel
programming by communication on distributed-memory machines. MPI has become
a standard for multiple-processor programming of code that runs on a
variety of machines, from small Beowulf installations to shared-memory
high-performance machines and hybrid clusters such as the ones
at HPCVL. Due to the complexity of the system, it is of course
impossible to give a detailed guide or even tutorial to MPI-programming in
the framework of a "Frequently Asked
Questions" file. This is not an introduction to MPI
programming. References and links for further details are given.
Frequently Asked Questions:
What is the Message
Passing Interface (MPI) ?
What kind of system uses MPI?
How is MPI used?
Give me an example.
How is MPI implemented on the
HPCVL machines?
How do I compile MPI code on HPCVL?
How do I run MPI programs on HPCVL
computers?
Where can I learn details about
MPI?
Are there any tools to help me with
MPI programming?
It doesn't work. Where can I get
help?
Answers:
What is the Message
Passing Interface (MPI) ?
The Message Passing Interface is a
communication system that was designed by a group of
researchers to supply programmers with a standard for
distributed-memory parallel programming that is portable
and usable on a variety of platforms. The standardization
process was initiated at a Workshop in 1992, and a draft
of the standard was available a year later. Version 1.0
was released in the Summer of 1994. The current version of
MPI is 2.0.
At present, the system is implemented in the form of
"bindings" for Fortran, C, and C++. MPI-1
(i.e. the original version 1.0 of the standard) consisted
of about 100 functions/subroutines which form the
MPI core and were available for Fortran and
C. MPI-2 approximately doubles the number of routines,
adding new features (which we will not discuss in this
FAQ), and introducing C++ bindings.
The number of essential MPI routines is quite limited. In our
experience, well-functioning MPI programs can be written with a
dozen or so. The routines are supplied in the form of
libraries, header files and usually scripts for easy
compilation. Almost any hardware/OS platform supplies a native
MPI version these days, and public-domain versions are
available as well.
Here is an overview of the main components of MPI-1:
- Bindings for C and Fortran
- Point-to-point communication
- Collective operations
- Process groups and communication domains
- User-defined data types and packing
- Process topologies
- Management, Inquiry, and Profiling
MPI-2 adds to this:
- C++ bindings
- Dynamic processes
- Parallel I/O
- One-sided operations
What kind
of system uses MPI?
MPI was designed to handle distributed-memory systems,
i.e. clusters. This does not mean that its usage is
restricted to such systems. In fact, there are good
arguments for MPI running even better on shared-memory
machines. It will of course also be usable on hybrids such
as the HPCVL Sunfire cluster.
Since MPI provides a means to enable communication between
different CPUs, it does not depend on shared-memory
architectures, as is the case for multi-threading systems such
as OpenMP. On the other hand, it can make use of shared memory
for fast amd improved communication.
Note that the status of MPI as a distributed-memory
system implies that multiple processes are started from the
beginning and run, usually on different CPUs to
completion. These processes do not have anything in common, and
each has its own memory space. Any information exchange
requires communication of data, for which MPI was designed.
How is MPI used?
MPI is a set of subroutines which are used explicitly
to communicate between processes. As such, MPI programs are
truly "multi-processing". Parallelisation can
not be done automatically or semi-automatically as in
"multi-threading" programs. Instead, function and
subroutine calls have to be inserted into the code and form
an integral part of the program. Often it is beneficial to
alter the algorithm of the code with respect to the serial
version.
The need to include the parallelism explicitly in the
program is both a curse and a blessing: while it means more
work and requires more planning than multi-threading, it also
often leads to more reliable and scalable code since the
behaviour of the latter is in the hands of the
programmer. Well-written MPI codes can be made to scale for
thousands of CPUs.
To create an MPI program, you need to:
- Include appropriate header files for the definitions of
variables and data structures. These are called
mpif.h, mpi.h, and mpi++.h for
Fortran, C, and C++, respectively.
- Program the communication between processes in the form
of calls to the MPI communication routines. These are
commonly of the form MPI_* for Fortran and C, and
MPI::* for C++.
- Bind in the proper libraries at the linking stage of
program compilation. This is usually done with the
-lmpi option of the compiler/linker.
MPI programs also usually need a special runtime
environment to be executed properly. This is commonly
supplied by the machine vendor and is machine specific.
Give me an example
The working principle of MPI is perhaps best illustrated on
the grounds of a programming
example. The following program,
written in Fortran 90 computes the sum of all square-roots of
integers from 0 up to a specific limit m:
module mpi
include 'mpif.h'
end module mpi
module cpuids
integer::myid,totps, ierr
end module cpuids
program example02
use mpi
use cpuids
call mpiinit
call demo02
call mpi_finalize(ierr)
stop
end
subroutine mpiinit
use mpi
use cpuids
call mpi_init( ierr )
call mpi_comm_rank(mpi_comm_world,myid,ierr)
call mpi_comm_size(mpi_comm_world,totps,ierr)
return
end
subroutine demo02
use mpi
use cpuids
integer:: m, i
real*8 :: s, mys
if(myid.eq.0) then
write(*,*)'how many terms?'
read(*,*) m
end if
call mpi_bcast(m,1,mpi_integer,0,mpi_comm_world,ierr)
mys=0.0d0
do i=myid,m,totps
mys=mys+dsqrt(dfloat(i))
end do
write(*,*)'rank:', myid,'mys=',mys, ' m:',m
s=0.0d0
call mpi_reduce(mys,s,1,mpi_real8,mpi_sum,0,mpi_comm_world,ierr)
if(myid.eq.0) then
write(*,*)'total sum: ', s
end if
return
end
Some of the common tasks that need to be performed in every MPI
program are done in the subroutine mpiinit in this
program. Namely, we need to call the routine mpi_init to
prepare the usage of MPI. This has to be done before any other MPI
routine is called. The two routine calls to mpi_comm_size
and call mpi_comm_rank determine how many processes are
running and what is the unique ID number of the present, i.e. the
calling process. Both pieces of information are essential. The
results are stored in the variables totps and myid,
respectively. Note that these variables appear in a module
cpuids so that they may be accessed from all routines that
"use" that module.
The main work in the example is done in the subroutine
demo02. Note that this routine does use the module
cpuids. The first operation is to determine the maximum
integer m in the sum by requesting input from the user. The
if-clause "if(myid.eq.0) then" serves to restrict
this I/O operation to only one process, the so-called "root
process", usually chosen to be the one with rank
(i.e. unique ID number) zero.
After this initial operation, communication has become necessary,
since only one process has the right value of m. This is
done by a call to the MPI collective operation routine
mpi_bcast. This call has the effect of
"broadcasting" the integer m. This call needs to
be made by all processes, and after they have done so, all
of them know m.
The sum over the square root is then executed on each process in a
slightly different manner. Each term is added to a local variable
mys. A stride of totps (the number of processes) in
the do-loop ensures that each process adds different terms to its
local sum, by skipping all others. For instance, if there are 10
processes, process 0 will add the square-roots of 0,10,20,30,...,
while process 7 will add the square-roots of 7,17,27,37,...
After the sums have been completed, further communication is
necessary, since each process only has computed a partial, local
sum. We need to collect these local sums into one total, and we do
so by calling mpi_reduce. The effect of this call is to
"reduce" a value local to each process to a variable that
is local to only one process, usually the root process. We
can do this in various ways, but in our case we choose to sum the
values up by specifying mpi_sum in the function
call. Afterwards, the total sum resides in the variable s,
which is printed out by the root process.
The last operation done in our example is finalizing MPI usage by a
call to mpi_finalize, which is necessary for proper program
completion.
In this simple example, we have distributed the tasks of computing
many square roots among processes, each of which only did a part of
the work. We used communication to exchange information about the
tasks that needed to be performed, and to collect results. This
mode of programming is called "task parallel". Often it
is necessary to distribute large amounts of data among processes as
well, leading to "data parallel" programs. Of course, the
distinction is not always clear.
How is MPI implemented on the HPCVL machines?
While MPI itself is a portable, platform independent
standard, much like a programming language, the actual
implementation is necessarily platform dependent since it has
to take into account the architecture of the machine or
cluster in question.
HPCVL machines are mostly shared-memory machines that form
a cluster. The Sun MPI implementation makes use of this
structure by exploiting the rapid access to a common memory
space for MPI communication. This is done by means of a
so-called "shared-memory layer". As a result,
communication between CPUs on the same shared-memory node are
orders of magnitude faster and more reliable than between
nodes. Our cluster is therefore configured in to preferably
schedule processes within a single node.
The MPI libraries, headers, etc. reside in the
/opt/SUNWhpc directory subtree. It is usually
not necessary to include the proper directories in
the PATH variable as this is done by default by the
system. The /opt/SUNWhpc directory includes
several versions (6, 7, and 8) of the
Sun ClusterTools (CT) environment which
enables the compiling and running of multi-process programs.
The current default version of ClusterTools is 8.1
which uses OpenMPI as its MPI implementation. For
compatibility, we keep older versions of ClusterTools, such as
the Sun MPI based CT6. This will not be discontinued in
the future.
If possible run MPI jobs using
the default version of ClusterTools. This version is
based on OpenMPI which
is an OpenSource version of MPI and therefore offers a greater
degree of compatibility with other platforms.
How do I compile MPI code on HPCVL?
The compilation of MPI programs requires a few compiler
options to direct the compiler to the location of header files
and libraries. Since these switches are always the same, they
have been collected in a macro to avoid unnecessary
typing. The macro is has an mp or mpi prefix
before the normal compiler name, for ClusterTools 6 or
ClusterTools 7 and higher, repsectively. The commands are mpf90
or mpi90 for Fortran 90 compiler,
and mpcc, mpicc, mpCC, and
mpiCC for the C and C++ compilers,
respectively. For instance, if a serial C program is compiled by
cc -xO3 -c test.c the corresponding parallel
(MPI) program is compiled by
mpcc -xO3 -c test_mpi.c [ClusterTools 6]
mpicc -xO3 -c test_mpi.c [ClusterTools 8.1]
In the linking stage, the specification of the
MPI libraries is required for ClusterTools 6. This is done
via the -lmpi option.
For ClusterTools 7 and higher this inclusion is automatic. For example,
the above MPI program should be linked with something like
this:
mpcc -o test_mpi.exe test_mpi.o -lmpi [ClusterTools 6]
mpicc -o test_mpi.exe test_mpi.o [ClusterTools 8.1]
Compiling and linking may also be combined.
How do I run MPI programs on HPCVL computers?
To run MPI programs, a special Runtime Environment is
required. This differs from platform to platform. On our Sun
machines, which run Solaris, it is
called ClusterTools or CT. This
includes commands for the control of multi-process jobs. The
most important ones are:
- mpinfo (only CT6) gives general
information about the
cluster or machine. The options -N and -P are
commonly used to obtain more details about the nodes and
partitions of a cluster, respectively. (Nodes are
physical machines in a cluster, partitions are logical groups
of CPUs).
- mprun (CT6) and mpirun (CT7 and higher)
are used to start a multi-process run of a program. They are
required to run MPI programs. The most
commonly used command line option is -np to specify the
number of processes to be started on the default
partition. Note that CT7 understands the -n option as well,
but CT6 does not.
For instance, the following line will start the
program test_mpi.exe with 9 processes:
mprun -np 9 test_mpi.exe [ClusterTools 6]
mpirun -n 9 test_mpi.exe [ClusterTools 8.1]
- mpps (only CT6) is the multi-process equivalent to the Unix
command ps, and used to produce a list of multi-process
jobs. Without options, only the jobs of the current user that were
started from the current shell, and are running in the default
partition are shown. (A partition is a logical group
of processors, defined by the system). Option available include
-e to show all jobs, -f to give a longer description
including all processes, starting times, etc., -A to include
all partitions.
- mpkill (only CT6), followed by the JID (job-ID, determined
through an mpps call) terminates a multi-process job and
sends a "kill signal" to all its processes. The option
-9 will force the termination (just as it does in the Unix
kill command).
The mprun and mpirun commands offer
additional options that are sometimes useful or required. Most
tend to interfere with the scheduling of jobs in a multi-user
environment such as ours and should be used with
caution. Please consult the man pages for details.
Finally, the -x sge option (only CT6) is used
to indicate that the node list for the processes is supplied
by the Gridengine scheduling software. This option is used
when MPI programs are executed via Gridengine, and appears
normally inside of Gridengine submission scripts. If -x
sge is specified, the number of processes does not have to
be specified anymore, since it will be determined by the
Gridengine. A command line looks like this:
mprun -x sge program.exe
Note that the usage of Gridengine is mandatory for
production jobs on our
system. This option is therefore used frequently. For a
details about Gridengine and jobs submission on the Sunfire
Cluster,
go here.
Note that the usage of the -x sge option
is not supported if you use ClusterTools 8, as
the mpirun command detects usage through Grid Engine
automatically. In this case, the -n option is
not used either, bringing the command line down to a simple
mpirun program.exe
Where can I learn details about
MPI?
As already pointed out, this FAQ is not an introduction to MPI
programming. The standard reference text on MPI is:
Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra: MPI - The Complete Reference (2nd
edition), The MIT Press, Cambridge, Massachusetts, 2000; 2
volumes, ISBN 0-262-69215-5/0-262-69213-3
This text specifies all MPI routines and concepts, and includes a
large number of examples. Most people will find it sufficient for
all their needs.
A quite good online tutorial for MPI programming can be found at the
Maui HPCC site.
For Sun and Solaris specific questions, including CRE and Sun MPI,
visit the Sun Documentation Site
and use their Search Engine to look for "Sun MPI",
"CRE", and "ClusterTools".
There is also an official MPI
webpage which contains the standards documents for MPI and
gives access to the MPI Forum.
HPCVL also organizes Workshops on a regular basis, and one of them
is devoted to basic MPI programming. They are announced on our web site. We might see you
there sometime soon.
Are there
any tools helping me with MPI programming?
Standard debugging and profiling tools such as Sun Studio are
designed for serial or multi-threaded programs. They do not handle
multi-process runs very well.
Quite often, the best way to check the performance of an
MPI program is timing it by insertion of suitable
routines. MPI supplies a "wall-clock" routine called
MPI_WTIME(), that lets you determine how much actual time was
spent in a specific segment of your code. An other method is
calling the subroutines ETIME and DTIME, which can give you
information about the actual CPU time used. However, it is
advisable to carefully read the documentation before using
them with MPI programs. In this case, go
to docs.sun.com and search
for "Sun Studio 12: Fortran Library Reference".
HPCVL also provides a package called the HPCVL Working
Template (HWT), which was created by Gang Liu and has now
reached version 5.3. The HWT provides 3 main functionalities:
- Maintenance of multiple versions of the same code from a single
source file. This is very useful, if your MPI code is based on a
serial code that you want to convert.
- Automatic Relative Debugging which allows you to use
pre-existing code (for example the serial version of your program)
as a reference when checking the correctness of your MPI code.
- Simple Timing which is needed to determine bottlenecks
for parallelization, to optimize code, and to check its scaling
properties.
The HWT is based on libraries and script files. It is easy to use
and portable (written largely in Fortran). Fortran, C, C++, and any
mixture thereof are supported, as well as MPI and OpenMP for
parallelism. Documentation of the HWT
is available. The package is installed on the Sunfire cluster in
/usr/local/hwt.
It doesn't
work. Where can I get help?
Most of the Sun specific issues addressed in this FAQ are
documented
at http://docs.sun.com. The
search engine provides a reliable means to find specific
documents. You can also call or send email
to one of our
support staff, who include several scientific
programmers. Keep in mind that we support many people at any
given time, so we cannot do the coding for you. But we can do
my best to help you get your code ready for multi-processor
machines.
|