Computer Science 341-02
Parallel Processing

Mount Holyoke College
Fall 2007


Assignment 4: Jacobi Iteration with OpenMP
Due: 9:00 AM, Thursday, October 18, 2007


Readings

This week's readings are your introduction to message passing, but you will not do any real programming in the message passing paradigm until the next assignment.

Read Quinn Ch. 4.

What about distributed memory?

So far we have seen three ways to create a parallel program:

  1. Let the compiler do whatever it can completely automatically
  2. Create threads explicitly using pthreads
  3. Specify parallel sections using OpenMP

These all suffer from one significant limitation - the cooperating threads must be able to communicate through shared variables.

How can we design and run parallel programs to work when there is no shared memory available?

Message Passing

We will now consider the message passing paradigm.

Cooperating Processes

Unix programs can use fork() to create new processes.

The Unix system call fork() duplicates a process. The child is a copy of the parent - in execution at the same point, the statement after the return from fork().

The return value indicates if you are child or parent.

0 is child, >0 means parent, -1 means failure (limit reached, permission denied)

Example C program:

pid=fork();
if (pid) {
  parent stuff;
}
else {
  child stuff;
}

A more complete program that uses fork() along with three other system calls (wait(), getpid(), and getppid()) is here:

See: /cluster/examples/forking

Processes created using fork() do not share context, and must allocate shared memory explicitly, or rely on a form of message passing to communicate.

We will not consider the Unix fork() method of multiprocessing much, but here's an example to show you that in fact, even global variables are not shared by processes that are created with fork():

See: /cluster/examples/what_shared

Remember that the advantage of using processes such as these instead of threads is that the processes could potentially be running on different systems. But if they are going to cooperate, they will need to communicate:

Sockets and pipes provide only a very rudimentary interprocess communication. Each "message" sent through a pipe or across a socket has a unique sender and unique receiver and is really nothing more than a stream of bytes. The sender and receiver must add any structure to these communcations.

Here's a very simplistic example of two processes that can communicate over raw sockets. It is included mainly to show you that you don't want to be doing this if you can help it:

See: /cluster/examples/socket

For many applications, this primitive interface is unreasonable. We want something at a higher level. Message passing libraries have evolved to meet this need.

Message Passing Libraries

Point-to-Point Communication

All message passing is based on the simple send and receive operations

   P0: send(addr,len,dest,tag)

   P1: receive(addr,max_len,src,tag,rec_len)

These are basic components in any message-passing implementation. There may be others introduced by a specific library.

Point-to-Point Communication - Blocking

Blocking communication has simple semantics:

But beware of deadlock when using blocking routines!!
              Proc 0          Proc 1
        -------------------------------------
            bsend(to 1)    bsend(to 0)
           brecv(from 1)  brecv(from 0)

If both processors' send buffers cannot be copied into system buffers, or if the calls are strictly synchronous, the calls will block until the corresponding receive call is made... Neither can proceed... deadlock...

Possible solutions - reorder to guarantee matching send/receive pairs, or use nonblocking routines...

Point-to-Point Communication - Nonblocking

Example:

              Proc 0          Proc 1
        -------------------------------------
            nbsend(to 1)    nbsend(to 0)
           nbrecv(from 1)  nbrecv(from 0)
           compute......   compute......
             waitall         waitall
            use result      use result

During the "compute......" phase, it's possible that the communication can be completed "in the background" while the computation proceeds, so when the "waitall" lines are reached, the program can just continue.

Deadlock is less likely but we still must be careful - the burden of avoiding deadlock is on the programmer in a message passing model of parallel computation.

Collective Communication

Some common operations don't fit well into a point-to-point communication scheme. Here, we may use collective communication routines.

These kinds of operations can be achieved through a series of point-to-point communication steps, but operators are often provided.

Using collective communication operators provided is generally better than trying to do it yourself. In addition to providing convenience, the message passing library can often perform the operations more efficiently by taking advantage of low-level functionality.

Message Passing Interface (MPI)

The Message Passing Interface (MPI) was created by a standards committee in the early 1990's.

MPI Terminology

MPI Simple Program - basic MPI functions

You already saw and exectuted a simple MPI "Hello, World" program back in the first week.

Remember that you will need to run this with mpirun (on bullpen) or mpiexec (on dhanni) if you want to get more than one process!

MPI calls and constructs in the "Hello, World" program:

The model of parallelism is very different from what we have seen. All of our processes exist for the life of the program. We are not allowed to do anything before MPI_Init() or after MPI_Finalize(). We need to think in terms of a number of copies of the same program all starting up at MPI_Init().

MPI - Point-to-Point message types

MPI One Message Program - Blocking Message functions

A simple MPI program that sends a single message:

See: /cluster/examples/mpimsg
MPI Ring Message Program - Nonblocking Message functions

A slightly more interesting MPI program that sends one message from each process:

See: /cluster/examples/mpiring
MPI - Collective Communication functions

See: /cluster/examples/mpicoll
MPI - Scatter/Gather

See: /cluster/examples/mpiscatgath

To understand what is going on with the various broadcast and scatter/gather functions, consider this figure, taken from the MPI Standard, p.91

Sample Applications

Conway's Game of Life

The Game of Life was invented by John Conway in 1970. The game is played on a field of cells, each of which has eigh neighbors (adjacent cells). A cell is either occupied (by an organism) or not. The rules for deriving a generation from the previous one are:

The game is very interesting in that complex patterns and cycles arise. Do a google search to find plenty of Java applets you can try out.

Serial version: See: /cluster/examples/life

MPI version: See: /cluster/examples/mpilife

Computing the Mandelbrot Set

Some Mandelbrot Set information and pictures can be found here.

Java Threads version: See: /cluster/examples/mset_java

MPI version: See: /cluster/examples/mset_mpi

Improved MPI version: See: /cluster/examples/mset_mpi_better

Matrix-Matrix Multiplication

Matrix-matrix multiplication using message passing is not as straightforward as matrix-matrix multiplication using shared memory and threads. Why?

Think about this example and we will discuss it in our meetings and will consider it further in the next assignment's readings.

Lab Tasks

There are several files to turn in for this assignment. They should all be submitted with the turnin utility as assignment4.

Write a C or C++ program using OpenMP that solves Laplace's equation on a two-dimensional, uniform, square grid, using Jacobi iteration. Don't worry if none of those terms make any sense - this document tells you what little you need to know about the math and physics.

Some background

Laplace's equation is an elliptic partial differential equation that governs physical phenomena such as heat. In two dimensions, it can be written

Phi_xx + Phi_yy = 0.

Given a spatial region and values for points on the boundaries of the region, the goal is to approximate the steady-state solution for points in the interior. We do this by covering the region with an evenly-spaced grid of points. A grid of 8 ×8 would look like this:

* * * * * * * * * *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* * * * * * * * * *

The 8 ×8 grid represented by the dots is surrounded by a layer of boundary points, represented by *'s. Each interior point is initialized to some value. Boundary points remain constant throughout the simulation. The steady-state values of interior points are calculated by repeated iterations. On each iteration, the new value of a point is set to a combination of the old values of neighboring points. The computation terminates either after a given number of iterations or when every new value is within some acceptable difference eps of every old value.

There are several iterative methods for solving Laplace's equation. Your program is to use Jacobi iteration, which is the simplest and easily parallelizable, though certainly not the most efficient in terms of convergence rate.

In Jacobi iteration, the new value for each grid point in the interior is set to the average of the old values of the four points left, right, above, and below it. This process is repeated until the program terminates. Note that some of the values used for the average will be boundary points.

What to do and how to do it

Your submission should include an appropriate Makefile, your C or C++ source code (including the timer code from class, if you choose to use it), a brief README file expaining how to run your program and with your timings and analysis. Please do not include object files or your executable.

Grading guidelines: Your grade for the programming will be

determined by correctness, design, documentation, and style, as well as the presentation of your timing results.