Lecture 7

Lecture 7 - Message Passing

Agenda

Announcements
Homework/Lab 4
What about distributed memory?
Message Passing
MPI

Announcements

CS Colloquium this week: Computer Science faculty talk about their research.
Another summer opportunity: work with Mary on the lab this summer.

What about distributed memory?

So far we have seen three ways to create a parallel program:

Let the compiler do whatever it can completely automatically
Create threads explicitly using pthreads
Specify parallel sections using OpenMP

These all suffer from one significant limitation - the cooperating threads must be able to communicate through shared variables.

How can we design and run parallel programs to work when there is no shared memory available?

Message Passing

Message Passing - processors have local memory, communicate by sending and receiving messages, no direct access to off-processor memory

Characteristics:
- Locality - each processor accesses only its local memory
- Explicit parallelism - messages are sent and received explicitly - programmer controls all parallelism. The compiler doesn't do it.
- Cooperation - every send must have a matching receive in order for the communication to take place. Beware of deadlock! One sided communication is possible, but doesn't really fit the pure message-passing model.
Advantages:
- Hardware - many current clusters of workstations and supercomputers fit well into the message passing model.
- Functionality - full access to parallelism. Don't have to rely on a compiler. Programmer can decide when parallelism makes sense. Also a disadvantage - burden on the programmer! If the compiler can do it for you, let it!
- Performance - data locality is important - especially in a multi-level memory hierarchy which includes off-processor data. Superlinear speedup - cache effects. Communication is often MUCH more expensive than computation.

Cooperating Processes

Unix programs can use fork() to create new processes. CS 432 veterans remember this from their Cow Shell projects.

The Unix system call fork() duplicates a process. The child is a copy of the parent - in execution at the same point, the statement after the return from fork().

The return value indicates if you are child or parent.

0 is child, >0 means parent, -1 means failure (limit reached, permission denied)

Example C program:

pid=fork();
if (pid) {
  parent stuff;
}
else {
  child stuff;
}

A more complete program that uses fork() along with three other system calls (wait(), getpid(), and getppid()) is here:

forking.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.

Processes created using fork() do not share context, and must allocate shared memory explicitly, or rely on a form of message passing to communicate.

We will not consider the Unix fork() method of multiprocessing much, but here's an example to show you that in fact, even global variables are not shared by processes that are created with fork():

what_shared.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.

Remember that the advantage of using processes such as these instead of threads is that the processes could potentially be running on different systems. But if they are going to cooperate, they will need to communicate:

Two processes on the same system can communicate through a named or an unnamed pipe.
Two processes on different systems must communicate across a network - most commonly this is done using sockets.

Sockets and pipes provide only a very rudimentary interprocess communication. Each "message" sent through a pipe or across a socket has a unique sender and unique receiver and is really nothing more than a stream of bytes. The sender and receiver must add any structure to these communcations.

Here's a very simplistic example of two processes that can communicate over raw sockets. It is included mainly to show you that you don't want to be doing this if you can help it:

socket.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.

For many applications, this primitive interface is unreasonable. We want something at a higher level. Message passing libraries have evolved to meet this need.

Message Passing Libraries

Message passing is done through a set of library routines. Don't want to deal with the hardware directly. Don't want to be writing to DMA or be making TCP/IP calls or even creating sockets.
Examples: P4, PVM, MPL, MPI, MPI-2, etc etc etc
Common Characteristics:
- Process Management - start and stop processes, query number of procs or PID.
- Point-to-Point Communications - send/receive between processes
- Collective Communication - broadcast, scatter, gather, synchronization
Terminology:
- Buffering - copy into a buffer somewhere (in library, hardware)
- Blocking communication - wait for some "event" to complete a communication routine
- Nonblocking communication - "post" - return immediately.
- Synchronous communication - special case of blocking - send does not return until corresponding receive completes.
- Asynchronous communication - pretty much nonblocking

Point-to-Point Communication

Basic send and receive operations

   P0: send(addr,len,dest,tag)

   P1: receive(addr,max_len,src,tag,rec_len)

These are basic components in any message-passing implementation. There may be others introduced by a specific library.

addr is the address of the send/receive buffer
len is the length of the sent message
max_len is the size of the receive buffer (to avoid overflow)
rec_len is the length of the message actually received
dest identifies destination of a message being sent
src identifies desired sender of a message being received (or where it actually came from if "any source" is specified)
tag a user-defined identifier restricting receipt

Point-to-Point Communication - Blocking

Simple semantics:

send completes when send buffers are ready for reuse, after message received or at least copied into system buffers
receive completes when the receive buffer's value is ready to use

But beware of deadlock when using blocking routines!!

              Proc 0          Proc 1
        -------------------------------------
            bsend(to 1)    bsend(to 0)
           brecv(from 1)  brecv(from 0)

If both processors' send buffers cannot be copied into system buffers, or if the calls are strictly synchronous, the calls will block until the corresponding receive call is made... Neither can proceed...deadlock...

Possible solutions - reorder to guarantee matching send/receive pairs, or use nonblocking routines...

Point-to-Point Communication - Nonblocking

send or receive calls return immediately - but how do we know when it's done? When can we use the value?
can overlap computation and communication
must call a wait routine to ensure communication has completed before destroying send buffer or using receive buffer

Example:

              Proc 0          Proc 1
        -------------------------------------
            nbsend(to 1)    nbsend(to 0)
           nbrecv(from 1)  nbrecv(from 0)
           compute......   compute......
             waitall         waitall
            use result      use result

Note the potential overlap of computation and communication

Deadlock less likely but still must be careful - burden of avoiding deadlock is on the programmer in a message passing model of parallel computation

Collective Communication

communication among a group of processors
group can be all or a subset of processors in a computation
collective routines are blocking
types of collective operations
- synchronization/barrier - wait until all processors have reached a given point
- data movement - broadcast (i.e. error condition, distribute read-in values), scatter/gather (exchange boundary on a finite element problem, for example), all-to-all (extreme case of scatter/gather)
- reductions - collect data from all participating processors and operate on it (i.e. add, multiply, min, max)

Using collective communication operators provided is generally better than trying to do it yourself. The message passing library can often do it more efficiently by taking advantage of low-level functionality.

Message Passing Interface (MPI)

motivated by the lack of a good standard
- everyone had their own library
- PVM demonstrated that a portable library was feasible
- portablity and efficiency were conflicting goals
MPI-1 released in 1994, many implementations (free and proprietary) have since become available
provides C and Fortran interfaces (125 functions in the standard), recently C++ as well.
parallelism is explicit - the programmer must identify parallelism and implement a parallel algorithm using MPI constructs
MPI-2 has been finalized and implementations are coming

MPI Terminology

Rank - unique identifier for a process
- values are 0...n-1 when n procs are used.
- specify source and destination of messages
- conditional execution
Group - a set of processes, associated with a communicator.
- processes in a group can take part in a collective communication, for example
- often use the predefined communicator specifying the group of all processors in a communication: MPI_COMM_WORLD
- communicator ensures safe communication within a group - avoid potential conflicts with other messages
Application Buffer - application space containing data to send or received data
System Buffer - system space used to hold pending messages

MPI Simple Program - basic MPI functions

You already saw and exectuted a simple MPI program back in the first lab. Here's that code:

mpihello.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.

Remember that you will need to run this with mpirun if you want to get more than one process!

MPI calls and constructs in the simple program:

#include <mpi.h> - the standard MPI header file
MPI_Init(int *argc, char *argv[]) - MPI Initialization
MPI_COMM_WORLD - the global communicator. Use for MPI_Comm args in most situations
MPI_Abort(MPI_Comm comm, int rc) - MPI Abort function
MPI_Comm_size(MPI_Comm comm, int *numprocs) - returns the number of processes in a given communicator in numprocs
MPI_Comm_rank(MPI_Comm comm, int *pid) - returns the rank of the current processor in the given communicator
MPI_Finalize() - clean up MPI

The model of parallelism is very different from what we have seen. All of our processes exist for the life of the program. We are not allowed to do anything before MPI_Init() or after MPI_Finalize(). We need to think in terms of a number of copies of the same program all starting up at MPI_Init().

MPI - Point-to-Point message types

MPI_Send/MPI_Recv - standard blocking calls (may have system buffer)
MPI_Isend/MPI_Irecv - standard nonblocking calls
MPI_Ssend/MPI_Issend - synchronous blocking/nonblocking send
MPI_Bsend/MPI_Ibsend - buffered blocking/nonblocking send - programmer allocates message buffer with MPI_Buffer_attach
MPI_Rsend/MPI_Irsend - ready mode send - matching receive must have been posted previously
MPI_Sendrecv - combine send/recv into one call before blocking
wait calls for nonblocking communications: MPI_Wait, MPI_Waitall, MPI_Waitsome, MPI_Waitany
also: MPI_Probe and MPI_Test calls

MPI One Message Program - Blocking Message functions

A simple MPI program that sends a single message:

mpimsg.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.

MPI_Status status - structure which contains additional info following a receive
MPI_Send(void *buf,int count,MPI_Datatype type,int dest,int tag,MPI_Comm comm) - blocking send - does not return until the corresponding receive is completed. sends count copies of data of type type located in buf to the processor with pid dest
MPI_Recv(void *buf,int count,MPI_Datatype type,int src,int tag,MPI_Comm, MPI_Status status) - blocking receive - does not return until the message has been received. src may be specific PID or MPI_ANY_SOURCE which matches any.
MPI_Datatype examples: MPI_CHAR, MPI_INT, MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_BYTE, MPI_PACKED

MPI Ring Message Program - Nonblocking Message functions

A slightly more interesting MPI program that sends one message from each process:

mpiring.tgz Also available in /home/faculty/terescoj/shared/cs338/lect07.

MPI_Request request - structure which contains info needed by nonblocking calls
MPI_Isend(void *buf,int count,MPI_Datatype type,int dest,int tag,MPI_Comm comm,MPI_Request *req) - nonblocking send - returns immediately. buf cannot be modified until a wait function is called using this req
MPI_Irecv(void *buf,int count,MPI_Datatype type,int source,int tag,MPI_Comm comm, MPI_Request *req) - nonblocking receive - returns immediately. buf cannot be used until a wait function is called usint this req
MPI_Wait(MPI_Request *req, MPI_Status *status) - wait for completion of message which had req as its request argument. additional info such as source of a message received as MPI_ANY_SOURCE is contained in status