Computer Science 338
Parallel Processing

Williams College
Spring 2006

Tutorial Assignment 4: Jacobi Iteration with OpenMP
Due: Tuesday, February 28, 2006 at 9:00 AM


We'll be back to a regular meeting schedule on February 28 and March 1.


This week's readings are your introduction to message passing, but you will not do any real programming in the message passing paradigm until next week.

Read Quinn Ch. 4.

What about distributed memory?

So far we have seen three ways to create a parallel program:

  1. Let the compiler do whatever it can completely automatically
  2. Create threads explicitly using pthreads
  3. Specify parallel sections using OpenMP

These all suffer from one significant limitation - the cooperating threads must be able to communicate through shared variables.

How can we design and run parallel programs to work when there is no shared memory available?

Message Passing

We will now consider the message passing paradigm.

Cooperating Processes

Unix programs can use fork() to create new processes. CS 432 veterans remember this from their Cow Shell projects.

The Unix system call fork() duplicates a process. The child is a copy of the parent - in execution at the same point, the statement after the return from fork().

The return value indicates if you are child or parent.

0 is child, >0 means parent, -1 means failure (limit reached, permission denied)

Example C program:

if (pid) {
  parent stuff;
else {
  child stuff;

A more complete program that uses fork() along with three other system calls (wait(), getpid(), and getppid()) is here:

See: /usr/cs-local/share/cs338/examples/forking

Processes created using fork() do not share context, and must allocate shared memory explicitly, or rely on a form of message passing to communicate.

We will not consider the Unix fork() method of multiprocessing much, but here's an example to show you that in fact, even global variables are not shared by processes that are created with fork():

See: /usr/cs-local/share/cs338/examples/what_shared

Remember that the advantage of using processes such as these instead of threads is that the processes could potentially be running on different systems. But if they are going to cooperate, they will need to communicate:

Sockets and pipes provide only a very rudimentary interprocess communication. Each "message" sent through a pipe or across a socket has a unique sender and unique receiver and is really nothing more than a stream of bytes. The sender and receiver must add any structure to these communcations.

Here's a very simplistic example of two processes that can communicate over raw sockets. It is included mainly to show you that you don't want to be doing this if you can help it:

See: /usr/cs-local/share/cs338/examples/socket

For many applications, this primitive interface is unreasonable. We want something at a higher level. Message passing libraries have evolved to meet this need.

Message Passing Libraries

Point-to-Point Communication

All message passing is based on the simple send and receive operations

   P0: send(addr,len,dest,tag)

   P1: receive(addr,max_len,src,tag,rec_len)

These are basic components in any message-passing implementation. There may be others introduced by a specific library.

Point-to-Point Communication - Blocking

Blocking communication has simple semantics:

But beware of deadlock when using blocking routines!!

              Proc 0          Proc 1
            bsend(to 1)    bsend(to 0)
           brecv(from 1)  brecv(from 0)

If both processors' send buffers cannot be copied into system buffers, or if the calls are strictly synchronous, the calls will block until the corresponding receive call is made... Neither can proceed... deadlock...

Possible solutions - reorder to guarantee matching send/receive pairs, or use nonblocking routines...

Point-to-Point Communication - Nonblocking


              Proc 0          Proc 1
            nbsend(to 1)    nbsend(to 0)
           nbrecv(from 1)  nbrecv(from 0)
           compute......   compute......
             waitall         waitall
            use result      use result

During the "compute......" phase, it's possible that the communication can be completed "in the background" while the computation proceeds, so when the "waitall" lines are reached, the program can just continue.

Deadlock is less likely but we still must be careful - the burden of avoiding deadlock is on the programmer in a message passing model of parallel computation.

Collective Communication

Some common operations don't fit well into a point-to-point communication scheme. Here, we may use collective communication routines.

These kinds of operations can be achieved through a series of point-to-point communication steps, but operators are often provided.

Using collective communication operators provided is generally better than trying to do it yourself. In addition to providing convenience, the message passing library can often perform the operations more efficiently by taking advantage of low-level functionality.

Message Passing Interface (MPI)

The Message Passing Interface (MPI) was created by a standards committee in the early 1990's.

MPI Terminology
MPI Simple Program - basic MPI functions

You already saw and exectuted a simple MPI "Hello, World" program back in the first week.

Remember that you will need to run this with mpirun (on bullpen) or mpiexec (on dhanni) if you want to get more than one process!

MPI calls and constructs in the "Hello, World" program:

The model of parallelism is very different from what we have seen. All of our processes exist for the life of the program. We are not allowed to do anything before MPI_Init() or after MPI_Finalize(). We need to think in terms of a number of copies of the same program all starting up at MPI_Init().

MPI - Point-to-Point message types
MPI One Message Program - Blocking Message functions

A simple MPI program that sends a single message:

See: /usr/cs-local/share/cs338/examples/mpimsg

MPI Ring Message Program - Nonblocking Message functions

A slightly more interesting MPI program that sends one message from each process:

See: /usr/cs-local/share/cs338/examples/mpiring

MPI - Collective Communication functions

See: /usr/cs-local/share/cs338/examples/mpicoll

MPI - Scatter/Gather

See: /usr/cs-local/share/cs338/examples/mpiscatgath

To understand what is going on with the various broadcast and scatter/gather functions, consider this figure, taken from the MPI Standard, p.91

Sample Applications

Conway's Game of Life

The Game of Life was invented by John Conway in 1970. The game is played on a field of cells, each of which has eigh neighbors (adjacent cells). A cell is either occupied (by an organism) or not. The rules for deriving a generation from the previous one are:

The game is very interesting in that complex patterns and cycles arise. Do a google search to find plenty of Java applets you can try out.

Serial version: See: /usr/cs-local/share/cs338/examples/life

MPI version: See: /usr/cs-local/share/cs338/examples/mpilife

Computing the Mandelbrot Set

Some Mandelbrot Set Information and pictures

Java Threads version: See: /usr/cs-local/share/cs338/examples/mand-java

MPI version: See: /usr/cs-local/share/cs338/examples/mand-mpi

Improved MPI version: See: /usr/cs-local/share/cs338/examples/mand-mpi-better

Matrix-Matrix Multiplication

Matrix-matrix multiplication using message passing is not as straightforward as matrix-matrix multiplication using shared memory and threads. Why?

Think about this example and we will discuss it in the tutorial meeting and will consider it further in next week's readings.

Lab Tasks

You may work alone or in pairs on this week's program. Should you choose to work in a group, you may but are not required to work with your tutorial partner. Groups must be formed and confirmed by e-mail no later than Friday, February 24. If you work in a group, I recommend getting a Unix group set up to allow sharing of files without sharing of passwords. See Mary about this.

There are several files to turn in for this assignment. They should all be included in a file named tut04.tar that you submit using the turnin utility. Please use the filenames specified and be sure to include your name (and your partner's name, if you are working in a group) in each file.

Write a C or C++ program using OpenMP that solves Laplace's equation on a two-dimensional, uniform, square grid, using Jacobi iteration. Don't worry if none of those terms make any sense - this document tells you what little you need to know about the math and physics.

Some background

Laplace's equation is an elliptic partial differential equation that governs physical phenomena such as heat. In two dimensions, it can be written

Phi_xx + Phi_yy = 0.

Given a spatial region and values for points on the boundaries of the region, the goal is to approximate the steady-state solution for points in the interior. We do this by covering the region with an evenly-spaced grid of points. A grid of 8 ×8 would look like this:

* * * * * * * * * *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* * * * * * * * * *

The 8 ×8 grid represented by the dots is surrounded by a layer of boundary points, represented by *'s. Each interior point is initialized to some value. Boundary points remain constant throughout the simulation. The steady-state values of interior points are calculated by repeated iterations. On each iteration, the new value of a point is set to a combination of the old values of neighboring points. The computation terminates either after a given number of iterations or when every new value is within some acceptable difference eps of every old value.

There are several iterative methods for solving Laplace's equation. Your program is to use Jacobi iteration, which is the simplest and easily parallelizable, though certainly not the most efficient in terms of convergence rate.

In Jacobi iteration, the new value for each grid point in the interior is set to the average of the old values of the four points left, right, above, and below it. This process is repeated until the program terminates. Note that some of the values used for the average will be boundary points.

What to do and how to do it

Your submitted tar file should include the tut04.txt file, an appropriate Makefile, your C or C++ source code (including the timer code from class, if you choose to use it), your PBS script(s), a brief README file expaining how to run your program and with your answer. Please do not include object files or your executable.

Honor code guidelines: While the program is to be done only by you (meaning your group, if you choose to work in a group), along the lines of a laboratory program, I want to encourage you to ask questions and discuss the program with me, our TA, and with classmates outside your group, as you develop it. However, no sharing of code between groups is permitted. If you have any doubts, please check first and avoid honor code problems later.

Grading guidelines: Your grade for the programming will be determined by correctness, design, documentation, and style, as well as the presentation of your timing results.