Computer Science 341-02
Parallel Processing

Mount Holyoke College
Fall 2007

Assignment 4: Jacobi Iteration with OpenMP
Due: 9:00 AM, Thursday, October 18, 2007

Readings

This week's readings are your introduction to message passing, but you will not do any real programming in the message passing paradigm until the next assignment.

Read Quinn Ch. 4.

What about distributed memory?

So far we have seen three ways to create a parallel program:

Let the compiler do whatever it can completely automatically
Create threads explicitly using pthreads
Specify parallel sections using OpenMP

These all suffer from one significant limitation - the cooperating threads must be able to communicate through shared variables.

How can we design and run parallel programs to work when there is no shared memory available?

Message Passing

We will now consider the message passing paradigm.

Characteristics:
- Locality - each processor accesses only its local memory
- Explicit parallelism - messages are sent and received explicitly - programmer controls all parallelism. The compiler doesn't do it.
- Cooperation - every send must have a matching receive in order for the communication to take place. Beware of deadlock! One sided communication is possible, but doesn't really fit the pure message-passing model.
Advantages:
- Hardware - many current clusters of workstations and supercomputers fit well into the message passing model.
- Functionality - full access to parallelism. We don't have to rely on a compiler. The programmer can decide when parallelism makes sense. But this is also a disadvantage - the full burden is on the programmer! Advice: If the compiler can do it for you, let it!
- Performance - data locality is important - especially in a multi-level memory hierarchy which includes off-processor data. There is a chance for superlinear speedup with added cache as we talked about earlier in the course. Communication is often MUCH more expensive than computation.

Cooperating Processes

Unix programs can use fork() to create new processes.

The Unix system call fork() duplicates a process. The child is a copy of the parent - in execution at the same point, the statement after the return from fork().

The return value indicates if you are child or parent.

0 is child, >0 means parent, -1 means failure (limit reached, permission denied)

Example C program:

pid=fork();
if (pid) {
  parent stuff;
}
else {
  child stuff;
}

A more complete program that uses fork() along with three other system calls (wait(), getpid(), and getppid()) is here:

See: /cluster/examples/forking

Processes created using fork() do not share context, and must allocate shared memory explicitly, or rely on a form of message passing to communicate.

We will not consider the Unix fork() method of multiprocessing much, but here's an example to show you that in fact, even global variables are not shared by processes that are created with fork():

See: /cluster/examples/what_shared

Remember that the advantage of using processes such as these instead of threads is that the processes could potentially be running on different systems. But if they are going to cooperate, they will need to communicate:

Two processes on the same system can communicate through a named or an unnamed pipe.
Two processes on different systems must communicate across a network - most commonly this is done using sockets.

Sockets and pipes provide only a very rudimentary interprocess communication. Each "message" sent through a pipe or across a socket has a unique sender and unique receiver and is really nothing more than a stream of bytes. The sender and receiver must add any structure to these communcations.

Here's a very simplistic example of two processes that can communicate over raw sockets. It is included mainly to show you that you don't want to be doing this if you can help it:

See: /cluster/examples/socket

For many applications, this primitive interface is unreasonable. We want something at a higher level. Message passing libraries have evolved to meet this need.

Message Passing Libraries

Message passing is supported through a set of library routines. This allows programmers to avoid dealing with the hardware directly. Programmers want to concentrate on the problem they're trying to solve, not worrying about writing to special memory buffers or making TCP/IP calls or even creating sockets.
Examples: P4, PVM, MPL, MPI, MPI-2, etc. MPI and PVM are the most common.
Common Characteristics:
- Process Management - start and stop processes, query number of procs or PID.
- Point-to-Point Communications - send/receive between processes
- Collective Communication - broadcast, scatter, gather, synchronization
Terminology:
- Buffering - copy into a buffer somewhere (in library, hardware)
- Blocking communication - wait for some "event" to complete a communication routine
- Nonblocking communication - "post" a message and return immediately
- Synchronous communication - special case of blocking - send does not return until corresponding receive completes
- Asynchronous communication - pretty much nonblocking

Point-to-Point Communication

All message passing is based on the simple send and receive operations

   P0: send(addr,len,dest,tag)

   P1: receive(addr,max_len,src,tag,rec_len)

These are basic components in any message-passing implementation. There may be others introduced by a specific library.

addr is the address of the send/receive buffer
len is the length of the sent message
max_len is the size of the receive buffer (to avoid overflow)
rec_len is the length of the message actually received
dest identifies destination of a message being sent
src identifies desired sender of a message being received (or where it actually came from if "any source" is specified)
tag a user-defined identifier restricting receipt

Point-to-Point Communication - Blocking

Blocking communication has simple semantics:

send completes when send buffers are ready for reuse, after message received or at least copied into system buffers
receive completes when the receive buffer's value is ready to use

But beware of deadlock when using blocking routines!!

              Proc 0          Proc 1
        -------------------------------------
            bsend(to 1)    bsend(to 0)
           brecv(from 1)  brecv(from 0)

If both processors' send buffers cannot be copied into system buffers, or if the calls are strictly synchronous, the calls will block until the corresponding receive call is made... Neither can proceed... deadlock...

Possible solutions - reorder to guarantee matching send/receive pairs, or use nonblocking routines...

Point-to-Point Communication - Nonblocking

send or receive calls return immediately - but how do we know when it's done? When can we use the value?
can overlap computation and communication
must call a wait routine to ensure communication has completed before destroying send buffer or using receive buffer

Example:

              Proc 0          Proc 1
        -------------------------------------
            nbsend(to 1)    nbsend(to 0)
           nbrecv(from 1)  nbrecv(from 0)
           compute......   compute......
             waitall         waitall
            use result      use result

During the "compute......" phase, it's possible that the communication can be completed "in the background" while the computation proceeds, so when the "waitall" lines are reached, the program can just continue.

Deadlock is less likely but we still must be careful - the burden of avoiding deadlock is on the programmer in a message passing model of parallel computation.

Collective Communication

Some common operations don't fit well into a point-to-point communication scheme. Here, we may use collective communication routines.

collective communication occurs among a group of processors
the group can be all or a subset of the processors in a computation
collective routines are blocking
types of collective operations
- synchronization/barrier - wait until all processors have reached a given point
- data movement - broadcast (i.e. error condition, distribute read-in values), scatter/gather (exchange boundary on a finite element problem, for example), all-to-all (extreme case of scatter/gather)
- reductions - collect data from all participating processors and operate on it (i.e. add, multiply, min, max)

These kinds of operations can be achieved through a series of point-to-point communication steps, but operators are often provided.

Using collective communication operators provided is generally better than trying to do it yourself. In addition to providing convenience, the message passing library can often perform the operations more efficiently by taking advantage of low-level functionality.

Message Passing Interface (MPI)

The Message Passing Interface (MPI) was created by a standards committee in the early 1990's.

motivated by the lack of a good standard
- everyone had their own library
- PVM demonstrated that a portable library was feasible
- portablity and efficiency were conflicting goals
The MPI-1 standard was released in 1994, and many implementations (free and proprietary) have since become available
MPI specifies C and Fortran interfaces (125 functions in the standard), more recently C++ as well
parallelism is explicit - the programmer must identify parallelism and implement a parallel algorithm using MPI constructs
MPI-2 is an extention to the standard developed later in the 1990's and there are now some implementations

MPI Terminology

Rank - a unique identifier for a process
- values are 0...n-1 when n processes are used
- specify source and destination of messages
- used to control conditional execution
Group - a set of processes, associated with a communicator
- processes in a group can take part in a collective communication, for example
- we often use the predefined communicator specifying the group of all processors in a communication: MPI_COMM_WORLD
- the communicator ensures safe communication within a group - avoid potential conflicts with other messages
Application Buffer - application space containing data to send or received data
System Buffer - system space used to hold pending messages

MPI Simple Program - basic MPI functions

You already saw and exectuted a simple MPI "Hello, World" program back in the first week.

Remember that you will need to run this with mpirun (on bullpen) or mpiexec (on dhanni) if you want to get more than one process!

MPI calls and constructs in the "Hello, World" program:

#include <mpi.h> - the standard MPI header file
MPI_Init(int *argc, char *argv[]) - MPI Initialization
MPI_COMM_WORLD - the global communicator. Use for MPI_Comm args in most situations
MPI_Abort(MPI_Comm comm, int rc) - MPI Abort function
MPI_Comm_size(MPI_Comm comm, int *numprocs) - returns the number of processes in a given communicator in numprocs
MPI_Comm_rank(MPI_Comm comm, int *pid) - returns the rank of the current process in the given communicator
MPI_Get_processor_name(char *name, int *rc) - returns the name of the node on which the current process is running
MPI_Finalize() - clean up MPI

The model of parallelism is very different from what we have seen. All of our processes exist for the life of the program. We are not allowed to do anything before MPI_Init() or after MPI_Finalize(). We need to think in terms of a number of copies of the same program all starting up at MPI_Init().

MPI - Point-to-Point message types

MPI_Send/MPI_Recv - standard blocking calls (may have system buffer)
MPI_Isend/MPI_Irecv - standard nonblocking calls
MPI_Ssend/MPI_Issend - synchronous blocking/nonblocking send
MPI_Bsend/MPI_Ibsend - buffered blocking/nonblocking send - programmer allocates message buffer with MPI_Buffer_attach
MPI_Rsend/MPI_Irsend - ready mode send - matching receive must have been posted previously
MPI_Sendrecv - combine send/recv into one call before blocking
wait calls for nonblocking communications: MPI_Wait, MPI_Waitall, MPI_Waitsome, MPI_Waitany
also: MPI_Probe and MPI_Test calls

MPI One Message Program - Blocking Message functions

A simple MPI program that sends a single message:

See: /cluster/examples/mpimsg

MPI_Status status - structure which contains additional info following a receive
MPI_Send(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm) - blocking send - does not return until the corresponding receive is completed. sends count copies of data of type type located in buf to the processor with pid dest
MPI_Recv(void *buf, int count, MPI_Datatype type, int src, int tag, MPI_Comm comm, MPI_Status status) - blocking receive - does not return until the message has been received. src may be specific PID or MPI_ANY_SOURCE which matches any.
MPI_Datatype examples: MPI_CHAR, MPI_INT, MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_BYTE, MPI_PACKED

MPI Ring Message Program - Nonblocking Message functions

A slightly more interesting MPI program that sends one message from each process:

See: /cluster/examples/mpiring

MPI_Request request - structure which contains info needed by nonblocking calls
MPI_Isend(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm, MPI_Request *req) - nonblocking send - returns immediately. buf must not be modified until a wait function is called using this req
MPI_Irecv(void *buf, int count, MPI_Datatype type, int source, int tag, MPI_Comm comm, MPI_Request *req) - nonblocking receive - returns immediately. buf must not be used until a wait function is called using this req
MPI_Wait(MPI_Request *req, MPI_Status *status) - wait for completion of message which had req as its request argument. additional info such as source of a message received as MPI_ANY_SOURCE is contained in status

MPI - Collective Communication functions

See: /cluster/examples/mpicoll

MPI_Barrier(MPI_Comm comm) - synchronize procs
MPI_Bcast(void *buf,int count,MPI_Datatype type,int root,MPI_Comm comm) - broadcast - sends count copies of data of type type located in buf on proc root to buf on all others.
MPI_Reduce(void *sendbuf,void *recvbuf,int count, MPI_Datatype type,MPI_Op op,int root,MPI_Comm comm) - combines data in sendbuf on each proc using operation op and stores the result in recvbuf on proc root
MPI_Allreduce() - same as reduce except result is stored in recvbuf on all procs
MPI_Op values - MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND, MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR, MPI_MAXLOC, MPI_MINLOC plus user-defined
MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype type, MPI_Op op, MPI_Comm comm) - parallel prefix scan operations

MPI - Scatter/Gather

See: /cluster/examples/mpiscatgath

MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) - root sends sendcount items from sendbuf to each processor. Each processor receives recvcount items into recvbuf
MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) - each proc sends sendcount items from sendbuf to root. root receives recvcount items into recvbuf from each proc
MPI_Scatterv/MPI_Gatherv work with variable-sized chunks of data
MPI_Allgather/MPI_Alltoall variations of scatter/gather

To understand what is going on with the various broadcast and scatter/gather functions, consider this figure, taken from the MPI Standard, p.91

Sample Applications

Conway's Game of Life

The Game of Life was invented by John Conway in 1970. The game is played on a field of cells, each of which has eigh neighbors (adjacent cells). A cell is either occupied (by an organism) or not. The rules for deriving a generation from the previous one are:

Death: If an occupied cell has 0, 1, 4, 5, 6, 7, or 8 occupied neighbors, it dies (of either boredom or overcrowding, as the case may be)
Survival: If an occupied cell has 2 or 3 occupied neighbors, it survives to the next generation
Birth: If an unoccupied cell has 3 occupied neighbors, it becomes occupied.

The game is very interesting in that complex patterns and cycles arise. Do a google search to find plenty of Java applets you can try out.

Serial version: See: /cluster/examples/life

MPI version: See: /cluster/examples/mpilife

Computing the Mandelbrot Set

Some Mandelbrot Set information and pictures can be found here.

Java Threads version: See: /cluster/examples/mset_java

MPI version: See: /cluster/examples/mset_mpi

Improved MPI version: See: /cluster/examples/mset_mpi_better

Matrix-Matrix Multiplication

Matrix-matrix multiplication using message passing is not as straightforward as matrix-matrix multiplication using shared memory and threads. Why?

Since our memory is not shared, which processes have copies of the matrices?
Where does the data start out? Where do we want the answer to be in the end?
How much data do we replicate?
What are appropriate MPI calls to make all this happen?

Think about this example and we will discuss it in our meetings and will consider it further in the next assignment's readings.

Lab Tasks

There are several files to turn in for this assignment. They should all be submitted with the turnin utility as assignment4.

Write a C or C++ program using OpenMP that solves Laplace's equation on a two-dimensional, uniform, square grid, using Jacobi iteration. Don't worry if none of those terms make any sense - this document tells you what little you need to know about the math and physics.

Some background

Laplace's equation is an elliptic partial differential equation that governs physical phenomena such as heat. In two dimensions, it can be written

Phi_xx + Phi_yy = 0.

Given a spatial region and values for points on the boundaries of the region, the goal is to approximate the steady-state solution for points in the interior. We do this by covering the region with an evenly-spaced grid of points. A grid of 8 ×8 would look like this:

* * * * * * * * * *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* * * * * * * * * *

The 8 ×8 grid represented by the dots is surrounded by a layer of boundary points, represented by *'s. Each interior point is initialized to some value. Boundary points remain constant throughout the simulation. The steady-state values of interior points are calculated by repeated iterations. On each iteration, the new value of a point is set to a combination of the old values of neighboring points. The computation terminates either after a given number of iterations or when every new value is within some acceptable difference eps of every old value.

There are several iterative methods for solving Laplace's equation. Your program is to use Jacobi iteration, which is the simplest and easily parallelizable, though certainly not the most efficient in terms of convergence rate.

In Jacobi iteration, the new value for each grid point in the interior is set to the average of the old values of the four points left, right, above, and below it. This process is repeated until the program terminates. Note that some of the values used for the average will be boundary points.

What to do and how to do it

To avoid special cases at the boundaries, you should allocate your grid for an n ×n simulation to be (n+2) ×(n+2), so you can use row and column 0 and row and column n+1 to store the boundary values.
You will need two copies of the grid, one to store the "current" solution values and one to store the "next" solution values. Do not copy the "next" solution to the "current" solution at each step. Instead, always do two iterations. The first uses one grid as "current" and the other as "next," then the second swaps their roles. This is an example of a simple loop unrolling.
Each grid cell computation looks something like this:
```
  nextgrid[i][j] = (grid[i-1][j] + grid[i+1][j] +
                    grid[i][j-1] + grid[i][j+1]) * 0.25;
```
Doing * 0.25 is usually faster than / 4 since multiplication is an easier operation for a computer than division.
After each pair of iterations, the corresponding values in each cell of the two grids are compared, and the maximum such value is computed. If this value is less than eps, the computation terminates. Otherwise, it continues.
A maximum number of iterations is also specified, after which the computation stops even if the error tolerance has not been reached.
Your program should take two command-line arguments: the maximum number of iterations and the error tolerance eps.
Initialization of the boundary and interior will determine the exact problem being solved. I suggest initializing your interior to all 0's and the boundary to have two sides set to 1, two sides set to 0, as follows:
```
1 1 1 1 1 1 1 1 1 1
1 . . . . . . . . 0
1 . . . . . . . . 0
1 . . . . . . . . 0
1 . . . . . . . . 0
1 . . . . . . . . 0
1 . . . . . . . . 0
1 . . . . . . . . 0
1 . . . . . . . . 0
1 0 0 0 0 0 0 0 0 0
```
This keeps the initialization simple, but still allows for some work during the solution phase.
At the end of the computation, print out the total number of iterations needed, and the time taken to achieve the solution.
Get the serial version working first. Choose values for the size of the grid, maximum number of iterations, and eps so that your program will run for a few minutes.
Parallelize your program. Your parallel program should create and destroy the OpenMP threads only once.
You may break down the computation any way you'd like. Justify your choice in the README file that you include in your submission.
You will need a number of OpenMP directives. Explain each directive (what you needed to do and why you chose that particular directive) in detailed comments in your source code.
Run your program using 1, 2, and 4 threads on the four-processor Solaris nodes of the cluster (wetteland or rivera). Include the timings and your analysis of those timings in the README file that you include with your submission.
Extend your program to include extra features, such as support for more interesting initial conditions and boundary conditions. Describe these features in your README file.

Your submission should include an appropriate Makefile, your C or C++ source code (including the timer code from class, if you choose to use it), a brief README file expaining how to run your program and with your timings and analysis. Please do not include object files or your executable.

Grading guidelines: Your grade for the programming will be

determined by correctness, design, documentation, and style, as well as the presentation of your timing results.

Computer Science 341-02 Parallel Processing

Mount Holyoke College Fall 2007

Computer Science 341-02
Parallel Processing

Mount Holyoke College
Fall 2007