# Computer Science 338 Parallel Processing

### Williams College Spring 2006

Tutorial Assignment 5: More MPI
Due: Tuesday, March 7, 2006 at 9:00 AM

The majority of this week's readings are from the textbook. Please read Quinn Chapters 5, 6 and 8. These chapters use three major examples of parallel algorithms to demonstrate program design and implementation using the MPI model. For next week's tutorial meetings, I would like each of you to come with a list of the most important ideas covered in each chapter, and of course any questions you have about the readings.

There are also a few matrix-matrix multiplication examples that I would like you to look at.

#### Distributed Data Structures

The MPI version of Conway's Game of Life used a distributed data structure. Each process maintains its own subset of the computational domain, in this case just a number of rows of the grid. Other processes do not know about the data on a given process. Only that data that is needed to compute the next generation, a one-cell overlap, is exchanged between iterations.

The "slice by slice" method of distributing the grid was chosen only for its simplicity of implementation, both in the determination of what processes are given what rows, and the straightforward communication patterns that can be used to exchange boundary data. We could partition in more complicated patterns, but there would be extra work involved.

##### Matrix-Matrix Multiplication

The possiblities for the matrix-matrix multiply are numerous. Now the absolute easiest way to do it would be to distribute the matrix A by rows, have B replicated everywhere, and then have C by rows. If we distributed our matrices this way in the first place, everything is simple:

See: /usr/cs-local/share/cs338/examples/matmult_mpi_toosimple

This program has very little MPI communication - this is by design, as we distributed our matrices so that each process would have exactly what it needs.

Unfortunately, this is not likely to be especially useful. More likely, we will want all three matrices distributed the same way.

To make the situation more realistic, but still straightforward, let's assume that our initial matrices A and B are distributed by rows, in the same fashion as the Life simulator. Further, the result matrix C is also to be distributed by rows.

The process that owns each row will do the computation for that row. What information does each process have locally? What information will it need to request from other processes?

Matrix multiplication is a pretty "dense" operation, and we to send all the columns of B to all processes.

See: /usr/cs-local/share/cs338/examples/matmult_mpi_simple

Note that we only initialize rows of B on one process, but since it's all needed on every process, we need to broadcast those rows.

Can we do better? Can we get away without storing all of B on each process? We know we need to send it, but we we do all the computation that needs each row before continuing on to the next?

See: /usr/cs-local/share/cs338/examples/matmult_mpi_better

Yes, all we had to do was rearrange the loops that do the actual computation of the entries of C. We can broadcast each row, use it for everything it needs to be used for, then we move on. We save memory!

Even though we do the exact same amount of communication, our memory usage per process goes from O(n2) to O((n2)/(p)).

Next week, we will look at more complicated examples of distributed data structures and methods for determining the domain decomposition of those structures.

You may work alone or in pairs again on this week's programs. Any changes from last week's groups must be confirmed by an e-mail to me no later than Friday, March 3.

There are several files to turn in for this assignment. They should all be included in a file named tut05.tar that you submit using the turnin utility. Please use the filenames specified and be sure to include your name (and your partner's name, if you are working in a group) in each file.

1. Modify one of the matrix-matrix multiply examples that uses MPI for communication so that all three matrices exist only on the master process at the start and finish of the computation. Use MPI functions to distribute the entries of A and B, as needed, to the processes (include the master in the computation), do the multiplication, then use MPI functions to gather the result C back to the master.