Computer Science 341-02
Parallel Processing

Mount Holyoke College
Fall 2007


Assignment 5: More MPI
Due: 9:00 AM, Thursday, October 25, 2007, and 9:00 AM, Tuesday, October 30, 2007


Readings

The majority of this week's readings are from the textbook. Please read Quinn Chapters 5, 6 and 8. These chapters use three major examples of parallel algorithms to demonstrate program design and implementation using the MPI model. For next week's meetings, I would like each of you to come with a list of the most important ideas covered in each chapter, and of course any questions you have about the readings. We will discuss Chapter 5 on Tuesday, October 23, Chapter 6 on Thursday, October 25, and Chapter 8 on Tuesday, October 30.

There are also a few matrix-matrix multiplication examples that I would like you to look at.

Distributed Data Structures

The MPI version of Conway's Game of Life used a distributed data structure. Each process maintains its own subset of the computational domain, in this case just a number of rows of the grid. Other processes do not know about the data on a given process. Only that data that is needed to compute the next generation, a one-cell overlap, is exchanged between iterations.

The "slice by slice" method of distributing the grid was chosen only for its simplicity of implementation, both in the determination of what processes are given what rows, and the straightforward communication patterns that can be used to exchange boundary data. We could partition in more complicated patterns, but there would be extra work involved.

Matrix-Matrix Multiplication

The possiblities for the matrix-matrix multiply are numerous. Now the absolute easiest way to do it would be to distribute the matrix A by rows, have B replicated everywhere, and then have C by rows. If we distributed our matrices this way in the first place, everything is simple:

See: /cluster/examples/matmult_mpi_toosimple

This program has very little MPI communication - this is by design, as we distributed our matrices so that each process would have exactly what it needs.

Unfortunately, this is not likely to be especially useful. More likely, we will want all three matrices distributed the same way.

To make the situation more realistic, but still straightforward, let's assume that our initial matrices A and B are distributed by rows, in the same fashion as the Life simulator. Further, the result matrix C is also to be distributed by rows.

The process that owns each row will do the computation for that row. What information does each process have locally? What information will it need to request from other processes?

Matrix multiplication is a pretty "dense" operation, and we to send all the columns of B to all processes.

See: /cluster/examples/matmult_mpi_simple

Note that we only initialize rows of B on one process, but since it's all needed on every process, we need to broadcast those rows.

Can we do better? Can we get away without storing all of B on each process? We know we need to send it, but we we do all the computation that needs each row before continuing on to the next?

See: /cluster/examples/matmult_mpi_better

Yes, all we had to do was rearrange the loops that do the actual computation of the entries of C. We can broadcast each row, use it for everything it needs to be used for, then we move on. We save memory!

Even though we do the exact same amount of communication, our memory usage per process goes from O(n2) to O((n2)/(p)).

In the next assignment, we will look at more complicated examples of distributed data structures and methods for determining the domain decomposition of those structures.

Lab Tasks

There are several files to turn in for this assignment. They should all be submitted under a directory named assignment5 that you submit using the turnin utility. Please use the filenames specified and be sure to include your name in each file.

  1. Modify one of the matrix-matrix multiply examples that uses MPI for communication so that all three matrices exist only on the master process at the start and finish of the computation. Use MPI functions to distribute the entries of A and B, as needed, to the processes (include the master in the computation), do the multiplication, then use MPI functions to gather the result C back to the master. Your submission should include a subdirectory matmult that includes your Makefile, your source code, and a brief README file that expains how to run your program, describes how and why you chose to parallelize your program, and describes and analyzes your timing results. Please do not include object files or your executable in your submission. This part is due by 4:00 PM, Thursday, October 25.
  2. Convert your OpenMP program that solves Laplace's equation using Jacobi iteration to a distributed memory program using message passing with the MPI library for communication. Most of the requirements for the program are the same as those of the OpenMP version. This part of your submission should be in a subdirectory jacobi that includes your Makefile, your source code, and a brief README file that expains how to run your program, describes how and why you chose to parallelize your program, and describes and analyzes your timing results. Please do not include object files or your executable. This part is due by 4:00 PM, Tuesday, October 30, 2007. Grading guidelines: Your grade for the programs will be determined by correctness, design, documentation, and style, as well as the required items in the detailed comments and README file.