## Computer Science 338 |

**Tutorial Assignment 5: More MPI****Due: Tuesday, March 7, 2006 at 9:00 AM**

The majority of this week's readings are from the textbook. Please read Quinn Chapters 5, 6 and 8. These chapters use three major examples of parallel algorithms to demonstrate program design and implementation using the MPI model. For next week's tutorial meetings, I would like each of you to come with a list of the most important ideas covered in each chapter, and of course any questions you have about the readings.

There are also a few matrix-matrix multiplication examples that I would like you to look at.

The MPI version of Conway's Game of Life used a distributed data structure. Each process maintains its own subset of the computational domain, in this case just a number of rows of the grid. Other processes do not know about the data on a given process. Only that data that is needed to compute the next generation, a one-cell overlap, is exchanged between iterations.

The "slice by slice" method of distributing the grid was chosen only for its simplicity of implementation, both in the determination of what processes are given what rows, and the straightforward communication patterns that can be used to exchange boundary data. We could partition in more complicated patterns, but there would be extra work involved.

The possiblities for the matrix-matrix multiply are numerous. Now the
absolute easiest way to do it would be to distribute the matrix `A` by rows, have `B` replicated everywhere, and then have `C`
by rows. If we distributed our matrices this way in the first place,
everything is simple:

**See:** `/usr/cs-local/share/cs338/examples/matmult_mpi_toosimple`

This program has very little MPI communication - this is by design, as we distributed our matrices so that each process would have exactly what it needs.

Unfortunately, this is not likely to be especially useful. More likely, we will want all three matrices distributed the same way.

To make the situation more realistic, but still straightforward, let's
assume that our initial matrices `A` and `B` are distributed
by rows, in the same fashion as the Life simulator. Further, the
result matrix `C` is also to be distributed by rows.

The process that owns each row will do the computation for that row. What information does each process have locally? What information will it need to request from other processes?

Matrix multiplication is a pretty "dense" operation, and we to send
all the columns of `B` to all processes.

**See:** `/usr/cs-local/share/cs338/examples/matmult_mpi_simple`

Note that we only initialize rows of `B` on one process, but since
it's all needed on every process, we need to broadcast those rows.

Can we do better? Can we get away without storing all of `B` on
each process? We know we need to send it, but we we do all the
computation that needs each row before continuing on to the next?

**See:** `/usr/cs-local/share/cs338/examples/matmult_mpi_better`

Yes, all we had to do was rearrange the loops that do the actual
computation of the entries of `C`. We can broadcast each row, use
it for everything it needs to be used for, then we move on. We save
memory!

Even though we do the exact same amount of communication, our memory
usage per process goes from *O(n ^{2})* to

Next week, we will look at more complicated examples of distributed data structures and methods for determining the domain decomposition of those structures.

**Lab Tasks**

You may work alone or in pairs again on this week's programs. Any changes from last week's groups must be confirmed by an e-mail to me no later than Friday, March 3.

There are several files to turn in for this assignment. They should
all be included in a file named `tut05.tar` that you submit using
the `turnin` utility. *Please use the filenames specified*
and be sure to include your name (and your partner's name, if you are
working in a group) in each file.

- Modify one of the matrix-matrix multiply examples that uses MPI
for communication so that all three matrices exist only on the master
process at the start and finish of the computation. Use MPI functions
to distribute the entries of
`A`and`B`, as needed, to the processes (include the master in the computation), do the multiplication, then use MPI functions to gather the result`C`back to the master.Your submitted tar file should include a subdirectory

`matmult`that includes your`Makefile`, your C source code, your PBS script(s), and a brief`README`file that expains how to run your program, describes how and why you chose to parallelize your program, and describes and analyzes your timing results. Please do*not*include object files or your executable in your tar file. - Convert your OpenMP program that solves Laplace's equation using
Jacobi iteration to a distributed memory program using message passing
with the MPI library for communication.
Most of the requirements for the program are the same as those of the OpenMP version.

Your submitted tar file should include a subdirectory

`laplace`that includes your`Makefile`, your C source code, your PBS script(s), and a brief`README`file that expains how to run your program, describes how and why you chose to parallelize your program, and describes and analyzes your timing results. Please do*not*include object files or your executable in your tar file.**Honor code guidelines**: While the programs are to be done only by you (meaning your group, if you choose to work in a group), along the lines of a**laboratory program**, I want to encourage you to ask questions and discuss the programs with me and with classmates outside your group as you develop it. However, no sharing of code between groups is permitted. If you have any doubts, please check first and avoid honor code problems later.**Grading guidelines**: Your grade for the programs will be determined by correctness, design, documentation, and style, as well as the required items in the detailed comments and`README`file.