Lecture 7 - Message Passing
- Announcements
- Homework/Lab 4
- What about distributed memory?
- Message Passing
- MPI
- CS Colloquium this week: Computer Science faculty talk about
their research.
- Another summer opportunity: work with Mary on the lab this
summer.
What is Jacobi iteration?
So far we have seen three ways to create a parallel program:
- Let the compiler do whatever it can completely automatically
- Create threads explicitly using pthreads
- Specify parallel sections using OpenMP
These all suffer from one significant limitation - the cooperating
threads must be able to communicate through shared variables.
How can we design and run parallel programs to work when there is no
shared memory available?
Message Passing - processors have local memory, communicate by
sending and receiving messages, no direct access to off-processor
memory
- Characteristics:
- Locality - each processor accesses only its local memory
- Explicit parallelism - messages are sent and received
explicitly - programmer controls all parallelism. The compiler
doesn't do it.
- Cooperation - every send must have a matching receive in
order for the communication to take place. Beware of
deadlock! One sided communication is possible, but doesn't really
fit the pure message-passing model.
- Advantages:
- Hardware - many current clusters of workstations and
supercomputers fit well into the message passing model.
- Functionality - full access to parallelism. Don't have
to rely on a compiler. Programmer can decide when parallelism makes
sense. Also a disadvantage - burden on the programmer! If the
compiler can do it for you, let it!
- Performance - data locality is important - especially in
a multi-level memory hierarchy which includes off-processor data.
Superlinear speedup - cache effects. Communication is often MUCH
more expensive than computation.
Unix programs can use fork() to create new processes. CS 432
veterans remember this from their Cow Shell projects.
The Unix system call fork() duplicates a process. The child is
a copy of the parent - in execution at the same point, the statement
after the return from fork().
The return value indicates if you are child or parent.
0 is child, >0 means parent, -1 means failure (limit reached, permission denied)
Example C program:
pid=fork();
if (pid) {
parent stuff;
}
else {
child stuff;
}
A more complete program that uses fork() along with three other
system calls (wait(), getpid(), and getppid()) is
here:
forking.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.
Processes created using fork() do not share context, and must
allocate shared memory explicitly, or rely on a form of message
passing to communicate.
We will not consider the Unix fork() method of multiprocessing
much, but here's an example to show you that in fact, even global
variables are not shared by processes that are created with fork():
what_shared.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.
Remember that the advantage of using processes such as these instead
of threads is that the processes could potentially be running on
different systems. But if they are going to cooperate, they will need
to communicate:
- Two processes on the same system can communicate through a named
or an unnamed pipe.
- Two processes on different systems must communicate across a
network - most commonly this is done using sockets.
Sockets and pipes provide only a very rudimentary interprocess
communication. Each "message" sent through a pipe or across a
socket has a unique sender and unique receiver and is really nothing
more than a stream of bytes. The sender and receiver must add any
structure to these communcations.
Here's a very simplistic example of two processes that can communicate
over raw sockets. It is included mainly to show you that you don't
want to be doing this if you can help it:
socket.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.
For many applications, this primitive interface is unreasonable. We
want something at a higher level. Message passing libraries have
evolved to meet this need.
- Message passing is done through a set of library routines.
Don't want to deal with the hardware directly. Don't want to be
writing to DMA or be making TCP/IP calls or even creating sockets.
- Examples: P4, PVM, MPL, MPI, MPI-2, etc etc etc
- Common Characteristics:
- Process Management - start and stop processes, query
number of procs or PID.
- Point-to-Point Communications - send/receive between
processes
- Collective Communication - broadcast, scatter, gather,
synchronization
- Terminology:
- Buffering - copy into a buffer somewhere (in library, hardware)
- Blocking communication - wait for some "event" to complete a
communication routine
- Nonblocking communication - "post" - return immediately.
- Synchronous communication - special case of blocking - send
does not return until corresponding receive completes.
- Asynchronous communication - pretty much nonblocking
Basic send and receive operations
P0: send(addr,len,dest,tag)
P1: receive(addr,max_len,src,tag,rec_len)
These are basic components in any message-passing implementation.
There may be others introduced by a specific library.
- addr is the address of the send/receive buffer
- len is the length of the sent message
- max_len is the size of the receive buffer (to avoid overflow)
- rec_len is the length of the message actually received
- dest identifies destination of a message being sent
- src identifies desired sender of a message being received (or where it actually came from if "any source" is specified)
- tag a user-defined identifier restricting receipt
Simple semantics:
- send completes when send buffers are ready for reuse,
after message received or at least copied into system buffers
- receive completes when the receive buffer's value is ready
to use
But beware of deadlock when using blocking routines!!
Proc 0 Proc 1
-------------------------------------
bsend(to 1) bsend(to 0)
brecv(from 1) brecv(from 0)
If both processors' send buffers cannot be copied into system buffers,
or if the calls are strictly synchronous, the calls will block until
the corresponding receive call is made... Neither can proceed...deadlock...
Possible solutions - reorder to guarantee matching send/receive pairs, or use nonblocking routines...
- send or receive calls return immediately - but how
do we know when it's done? When can we use the value?
- can overlap computation and communication
- must call a wait routine to ensure communication has
completed before destroying send buffer or using receive buffer
Example:
Proc 0 Proc 1
-------------------------------------
nbsend(to 1) nbsend(to 0)
nbrecv(from 1) nbrecv(from 0)
compute...... compute......
waitall waitall
use result use result
Note the potential overlap of computation and communication
Deadlock less likely but still must be careful - burden of avoiding
deadlock is on the programmer in a message passing model of parallel
computation
- communication among a group of processors
- group can be all or a subset of processors in a computation
- collective routines are blocking
- types of collective operations
- synchronization/barrier - wait until all processors have
reached a given point
- data movement - broadcast (i.e. error condition,
distribute read-in values), scatter/gather (exchange boundary on a
finite element problem, for example), all-to-all (extreme case of
scatter/gather)
- reductions - collect data from all participating
processors and operate on it (i.e. add, multiply, min, max)
Using collective communication operators provided is generally better
than trying to do it yourself. The message passing library can often
do it more efficiently by taking advantage of low-level functionality.
- motivated by the lack of a good standard
- everyone had their own library
- PVM demonstrated that a portable library was feasible
- portablity and efficiency were conflicting goals
- MPI-1 released in 1994, many implementations (free and
proprietary) have since become available
- provides C and Fortran interfaces (125 functions in the
standard), recently C++ as well.
- parallelism is explicit - the programmer must identify
parallelism and implement a parallel algorithm using MPI constructs
- MPI-2 has been finalized and implementations are coming
- Rank - unique identifier for a process
- values are 0...n-1 when n procs are used.
- specify source and destination of messages
- conditional execution
- Group - a set of processes, associated with a communicator.
- processes in a group can take part in a collective
communication, for example
- often use the predefined communicator specifying the group of
all processors in a communication: MPI_COMM_WORLD
- communicator ensures safe communication within a group - avoid
potential conflicts with other messages
- Application Buffer - application space containing data to
send or received data
- System Buffer - system space used to hold pending messages
You already saw and exectuted a simple MPI program back in the first
lab. Here's that code:
mpihello.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.
Remember that you will need to run this with mpirun if you want
to get more than one process!
MPI calls and constructs in the simple program:
- #include <mpi.h> - the standard MPI header file
- MPI_Init(int *argc, char *argv[]) - MPI Initialization
- MPI_COMM_WORLD - the global communicator. Use for MPI_Comm args in most situations
- MPI_Abort(MPI_Comm comm, int rc) - MPI Abort function
- MPI_Comm_size(MPI_Comm comm, int *numprocs) - returns
the number of processes in a given communicator in numprocs
- MPI_Comm_rank(MPI_Comm comm, int *pid) - returns the
rank of the current processor in the given communicator
- MPI_Finalize() - clean up MPI
The model of parallelism is very different from what we have seen.
All of our processes exist for the life of the program. We are not
allowed to do anything before MPI_Init() or after MPI_Finalize(). We need to think in terms of a number of copies of
the same program all starting up at MPI_Init().
- MPI_Send/MPI_Recv - standard blocking calls (may have
system buffer)
- MPI_Isend/MPI_Irecv - standard nonblocking calls
- MPI_Ssend/MPI_Issend - synchronous blocking/nonblocking
send
- MPI_Bsend/MPI_Ibsend - buffered blocking/nonblocking
send - programmer allocates message buffer with MPI_Buffer_attach
- MPI_Rsend/MPI_Irsend - ready mode send - matching
receive must have been posted previously
- MPI_Sendrecv - combine send/recv into one call before
blocking
- wait calls for nonblocking communications: MPI_Wait,
MPI_Waitall, MPI_Waitsome, MPI_Waitany
- also: MPI_Probe and MPI_Test calls
A simple MPI program that sends a single message:
mpimsg.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.
- MPI_Status status - structure which contains additional info
following a receive
- MPI_Send(void *buf,int count,MPI_Datatype type,int
dest,int tag,MPI_Comm comm) - blocking send - does not return until
the corresponding receive is completed. sends count copies of
data of type type located in buf to the processor with pid
dest
- MPI_Recv(void *buf,int count,MPI_Datatype type,int
src,int tag,MPI_Comm, MPI_Status status) - blocking receive - does
not return until the message has been received. src may be
specific PID or MPI_ANY_SOURCE which matches any.
- MPI_Datatype examples: MPI_CHAR, MPI_INT,
MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_BYTE, MPI_PACKED
A slightly more interesting MPI program that sends one message from
each process:
mpiring.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.
- MPI_Request request - structure which contains info
needed by nonblocking calls
- MPI_Isend(void *buf,int count,MPI_Datatype type,int
dest,int tag,MPI_Comm comm,MPI_Request *req) - nonblocking send -
returns immediately. buf cannot be modified until a wait
function is called using this req
- MPI_Irecv(void *buf,int count,MPI_Datatype type,int
source,int tag,MPI_Comm comm, MPI_Request *req) - nonblocking
receive - returns immediately. buf cannot be used until a wait
function is called usint this req
- MPI_Wait(MPI_Request *req, MPI_Status *status) - wait
for completion of message which had req as its request argument.
additional info such as source of a message received as MPI_ANY_SOURCE is contained in status